pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.89k stars 1.92k forks source link

Incorrect parsing of scientific notation when converting `pl.String` to `pl.Decimal` #19367

Open plaflamme opened 2 hours ago

plaflamme commented 2 hours ago

Checks

Reproducible example

pl.read_csv(io.StringIO("f\n6.4E2"), schema_overrides={"f": pl.Decimal(scale=2)})

Log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 506, in read_csv
    df = _read_csv_impl(
         ^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 651, in _read_csv_impl
    pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: conversion from `str` to `decimal[*,2]` failed in column 'f' for 1 out of 1 values: ["6.4E2"]

### Issue description

Note that setting scale to 0 or 1 successfully parses the number, but the result is wrong:

```python
pl.read_csv(io.StringIO("f\n6.4E2"), schema_overrides={"f": pl.Decimal(scale=0)})
shape: (1, 1)
┌──────────────┐
│ f            │
│ ---          │
│ decimal[*,0] │
╞══════════════╡
│ 6            │
└──────────────┘

Expected behavior

The number should be parsed as 640 as a pl.Decimal

Installed versions

``` --------Version info--------- Polars: 1.10.0 Index type: UInt32 Platform: macOS-14.7-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 16.0.6 ] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec 2024.10.0 gevent great_tables matplotlib nest_asyncio numpy 2.1.2 openpyxl pandas pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
plaflamme commented 2 hours ago

Also note that str.to_decimal also fails to parse these numbers:

pl.Series("f", ["6.4", "6.4E2"]).str.to_decimal()
shape: (2,)
Series: 'f' [decimal[*,3]]
[
    6.400
    null
]