pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.55k stars 1.98k forks source link

Roadmap for `Decimal` Type? (missing features like `round`; currently duckdb/spark) #19630

Open Julian-J-S opened 3 weeks ago

Julian-J-S commented 3 weeks ago

Description

Problem

Many (financial) applications/transformations require "correctness" that cannot be achieved using float but only decimal. One big missing feature is round on decimal (currently "InvalidOperationError: round can only be used on numeric types"). But there are a few other quirks that show that Decimal is not "production ready" yet. (also documented that it is experimental) This is unfortunate because otherwise polars is just soooooo good ❤️ Currently I am either using pyspark or when I am in-process use duckdb (requires duckdb+pyarrow which is quite big) for rounding and going back to polars.

Question

Is there any roadmap for the Decimal type? I remember long ago that someone said "once 1.0 is there we focus on those things" 😃 Not meaning to put on any pressure but if there is no timeline for Decimal probably many people (including me) might use a different engine like duckdb that does the whole transformation/calculation. Would love to look into implementing this myself but not sure if I find time and dont have much experience with decimal details. 😇

some issues:

19756 (null or incorrect results)

15151 (round support)

19889 (fill_null not supported)

15154

15153

coastalwhite commented 3 weeks ago

We have quite some interest in pushing this forward. I have some stuff on my stack still, but I might take a look after that.

coastalwhite commented 2 weeks ago

Apart from round are there specific things that need to be addressed surrounding the Decimal type at the moment?

cmdlineluser commented 2 weeks ago

Some aggregations don't seem to be implemented in a groupby context and return null:

Julian-J-S commented 2 weeks ago

Apart from round are there specific things that need to be addressed surrounding the Decimal type at the moment?

Just checked my "decimal_problems-Notebook" and almost everything is actually fixed, awesome! 👍 👏

Short overview:

round not imlemented

pl.DataFrame({"x": ["1.23"]}, schema={"x": pl.Decimal}).with_columns(r=pl.col("x").round(decimals=2))

# InvalidOperationError: round can only be used on numeric types

"Incorrect" / No rounding from str to Decimal

pl.DataFrame({"x": ["0.04", "0.05", "0.06"]}).with_columns(
    d=pl.col("x").cast(pl.Decimal(scale=1)),
)

┌──────┬──────────────┐
│ x    ┆ d            │
│ ---  ┆ ---          │
│ str  ┆ decimal[*,1] │
╞══════╪══════════════╡
│ 0.04 ┆ 0.0          │
│ 0.05 ┆ 0.0          │
│ 0.06 ┆ 0.0          │
└──────┴──────────────┘

Other libararies like duckdb round here and not trucate

duckdb.sql("""select x, x::decimal(18,1) as d from df""")

┌─────────┬───────────────┐
│    x    │       d       │
│ varchar │ decimal(18,1) │
├─────────┼───────────────┤
│ 0.04    │           0.0 │
│ 0.05    │           0.1 │
│ 0.06    │           0.1 │
└─────────┴───────────────┘

Multiply Decimal col by int should not increase scale

(
    pl.DataFrame(
        {"d": ["0.1"], "i": [2]},
        schema_overrides={"d": pl.Decimal},
    ).with_columns(
        mult1=pl.col("d") * 2,
        mult2=pl.col("d") * pl.col("i"),
    )
)

┌──────────────┬─────┬──────────────┬──────────────┐
│ d            ┆ i   ┆ mult1        ┆ mult2        │
│ ---          ┆ --- ┆ ---          ┆ ---          │
│ decimal[*,1] ┆ i64 ┆ decimal[*,2] ┆ decimal[*,2] │
╞══════════════╪═════╪══════════════╪══════════════╡
│ 0.1          ┆ 2   ┆ 0.20         ┆ 0.20         │
└──────────────┴─────┴──────────────┴──────────────┘

Expectation: scale after multiplication is still 1 (only for INT)

Document that only str->Decimal has prec/scale inference but float->Decimal gets always truncated (38,0)

pl.DataFrame(
    {
        "float": [3.45],
        "str": ["3.45"],
    },
).with_columns(
    pl.all().cast(pl.Decimal).name.suffix("_decimal"),
)

┌───────┬──────┬───────────────┬──────────────┐
│ float ┆ str  ┆ float_decimal ┆ str_decimal  │
│ ---   ┆ ---  ┆ ---           ┆ ---          │
│ f64   ┆ str  ┆ decimal[38,0] ┆ decimal[*,2] │
╞═══════╪══════╪═══════════════╪══════════════╡
│ 3.45  ┆ 3.45 ┆ 3             ┆ 3.45         │
└───────┴──────┴───────────────┴──────────────┘
max-muoto commented 2 weeks ago

cum_sum not being supported has always been an issue: https://github.com/pola-rs/polars/issues/14745

Would echo the core point, this will be a huge unlock in those building financial applications on top of Polars.

Julian-J-S commented 1 week ago

Apart from round are there specific things that need to be addressed surrounding the Decimal type at the moment?

also added #19756 to the list where Decimal calculation results in null or incorrect values instead of raising