pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.18k stars 17.77k forks source link

ENH: Extend to_numeric to Convert Hexadecimal, Octal, and Binary Strings with Prefixes #59207

Open beci opened 1 month ago

beci commented 1 month ago

Feature Type

Problem Description

Extend the pandas.to_numeric function to support the conversion of strings representing hexadecimal, octal, and binary numbers when they start with the corresponding prefixes (0x, 0o, 0b).

s = pd.Series(["1.0", "2", -3, "0x32"])
pd.to_numeric(s)  # , errors="coerce")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File lib.pyx:2391, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "0x32"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[39], [line 2](vscode-notebook-cell:?execution_count=39&line=2)
      [1](vscode-notebook-cell:?execution_count=39&line=1) s = pd.Series(["1.0", "2", -3, "0x32"])
----> [2](vscode-notebook-cell:?execution_count=39&line=2) pd.to_numeric(s)  # , errors="coerce")

File oSDH5rfs-py3.11\Lib\site-packages\pandas\core\tools\numeric.py:232, in to_numeric(arg, errors, downcast, dtype_backend)
    [230](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:230) coerce_numeric = errors not in ("ignore", "raise")
    [231](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:231) try:
--> [232](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:232)     values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]
    [233](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:233)         values,
    [234](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:234)         set(),
    [235](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:235)         coerce_numeric=coerce_numeric,
    [236](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:236)         convert_to_masked_nullable=dtype_backend is not lib.no_default
    [237](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:237)         or isinstance(values_dtype, StringDtype)
    [238](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:238)         and not values_dtype.storage == "pyarrow_numpy",
    [239](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:239)     )
    [240](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:240) except (ValueError, TypeError):
    [241](file:///oSDH5rfs-py3.11/Lib/site-packages/pandas/core/tools/numeric.py:241)     if errors == "raise":

File lib.pyx:2433, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "0x32" at position 3

Feature Description

pandas.to_numeric is a versatile function for converting various data types to numeric values. However, it currently does not support the direct conversion of strings representing numbers in different bases (hexadecimal, octal, and binary) that use standard prefixes. Adding this feature would enhance the function's utility and align it with the conversion capabilities found in core Python functions and PEP standards.

Modify the pandas.to_numeric function to detect strings starting with 0x (hexadecimal), 0o (octal), and 0b (binary) and convert them to their corresponding integer values.

Alternative Solutions

import pandas as pd

def extended_to_numeric(series):
    def convert_value(value):
        if isinstance(value, str):
            if value.startswith(('0x', '0X')):
                return int(value, 16)
            elif value.startswith(('0o', '0O')):
                return int(value, 8)
            elif value.startswith(('0b', '0B')):
                return int(value, 2)
        return pd.to_numeric(value, errors='coerce')

    return series.apply(convert_value)

# Example usage
data = pd.Series(['0x1A', '0o32', '0b11010', '42', 'invalid'])
numeric_data = extended_to_numeric(data)
print(numeric_data)

Additional Context

No response

AnaDenisa commented 1 month ago

take