pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

`Expr.replace` silently fails when `return_dtype=pl.Decimal(scale=BIG)` #15037

Open scur-iolus opened 7 months ago

scur-iolus commented 7 months ago

Checks

Reproducible example

Ideally, the following test should pass, or at least an explicit exception should be raised:

from decimal import Decimal as Dec

import polars as pl

def test_replace_return_dec():
    col = pl.Series(name="my_data", values=["a", "b", "c"])
    mapping = {
        "a": Dec("4.0"),
        "b": Dec("5.0"),
        "c": Dec("6.0"),
    }
    col1 = col.replace(mapping, return_dtype=pl.Decimal(scale=37))
    col2 = col.replace(mapping, return_dtype=pl.Decimal(scale=38))
    assert tuple(col1) == (Dec("4.0"), Dec("5.0"), Dec("6.0"))  # ✅ OK
    # for some reason, the following assertion fails because values have become null
    # no error has been raised, it happened silently!
    assert tuple(col2) == (Dec("4.0"), Dec("5.0"), Dec("6.0"))  # ❌ KO with 1.0.0-beta.1
    # next line explicitly raises a InvalidOperationError since "conversion failed"
    _ = pl.Series(col1, dtype=pl.Decimal(scale=38))

Issue description

I'm aware that pl.Decimal is an "experimental work-in-progress feature and may not work as expected" (doc) but I've noticed an unexpected behavior which should at least raise an error, for now.

When using .replace (doc) with return_dtype=pl.Decimal and when specifying scale=100 (or any big number), values are converted to null / None and no error is raised, no warning is issued.

Expected behavior

I assume that an overflow silently occurs... An OverflowError, a ComputeError, a RuntimeError or a NotImplementedError should probably be thrown?

Installed versions

``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.31 Python: 3.11.8 (main, Feb 25 2024, 16:41:26) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: 3.8.3 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.0 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
mcrumiller commented 7 months ago

The reason is not obvious, but I think I have an "almost why" answer, and it involves the limits of polars Decimal type.

The precision parameter must be at least one more than the scale parameter, because any value less than 1 must be represented by 0.xxxx and the 0 takes up a digit.

The maximum polars decimal value is 10^38-1, which you can see for yourself:

import polars as pl
from decimal import Decimal

pl.Series([Decimal(10**38-1)])  # ok
pl.Series([Decimal(10**38)])    # error
# RuntimeError: BindingsError: "Decimal is too large to fit in Decimal128"

So when you specify scale=37, that means that precision must be 38 at minimum. which means 38 digits. Now, we can cast a value of [1] to a scale of 37:

pl.Series([1]).cast(pl.Decimal(scale=37))
# shape: (1,)
# Series: '' [decimal[38,37]]
# [
#         1
# ]

But a scale of 38 fails:

pl.Series([1]).cast(pl.Decimal(scale=38)) 
# polars.exceptions.ComputeError: conversion from `i64` to `decimal[38,38]` failed in column '' for 1 out of 1 values: [1]

And likewise a value of 2 fails at precision 37. So going further:

So we are most likely hitting the upper limit when we shouldn't, since we can indeed specify these with precision instead of scale and it works:

pl.Series([1], dtype=pl.Decimal).cast(pl.Decimal(precision=100000, scale=37))
# shape: (1,)
# Series: '' [decimal[100000,37]]
# [
#         1
# ]

So tl;dr the issue involves checks on precision/scale bounds I believe, and it sometimes yells and sometimes doesn't.