pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.87k stars 1.92k forks source link

Decimals: string to decimal fails if more decimal places than "scale" #13987

Open Julian-J-S opened 8 months ago

Julian-J-S commented 8 months ago

Checks

Reproducible example

pl.DataFrame({
    'x': ['1', '1.23', '1.2345'],
}).with_columns(
    d=pl.col("x").cast(pl.Decimal(scale=2)),
)

Log output

ComputeError: conversion from `str` to `decimal[*,2]` failed in column 'x' for 1 out of 3 values: ["1.2345"]

Issue description

Converting from str to decimal fails if the str values contain more decimal places than the specified "scale" of the Decimal.

Expected behavior

This works in other libraries (e.g. pyspark) and should also work in polars. Also polars supports casting from f64 "1.2345" to Decimal "1.23"

pyspark: image


Note: I am opening a lot of Decimal Issues (Decimals are required for many financial calculations where I currently use pyspark). I heard a bigger Decimal update in coming soon? Not sure abount the roadmap but happy to help.

Installed versions

``` 0.20.5 ```
petrosbar commented 7 months ago

Not all libraries handle it the same. For example, in pyarrow this would raise:

import pyarrow as pa
import pyarrow.compute as pc

pc.cast(pa.array(["1.2345"]), pa.decimal128(5, 2))
# ...
# ArrowInvalid: Rescaling Decimal128 value would cause data loss

In any case, it seems that this has been fixed since polars==0.20.6.

import polars as pl

pl.__version__
# '0.20.6'

pl.DataFrame({
    'x': ['1', '1.23', '1.2345'],
}).with_columns(
    d=pl.col("x").cast(pl.Decimal(scale=2)),
)

Output:

  shape: (3, 2)
┌────────┬──────────────┐
│ x      ┆ d            │
│ ---    ┆ ---          │
│ str    ┆ decimal[*,2] │
╞════════╪══════════════╡
│ 1      ┆ 1            │
│ 1.23   ┆ 1.23         │
│ 1.2345 ┆ 1.23         │
└────────┴──────────────┘
Julian-J-S commented 7 months ago

@petrosbar thanks for the info! You are right, it works now.

Maybe we should add a dedicated to_decimal with a allow_data_loss / allow_decimal_truncation parameter that allows to explicitly opt in/out of this functionality 🤔

matquant14 commented 3 weeks ago

Any update on this? I try to cast a string value to a decimal, but it returns NULL values.

image

I'm on polars 1.81 image

cmdlineluser commented 3 weeks ago

@matquant14 It's a bit easier to help if you post an example as runnable code.

It seems commas need to be stripped from strings in order for .cast / .str.to_decimal to work properly.

>>> pl.select(pl.lit("1,000.12").str.to_decimal())
shape: (1, 1)
┌──────────────┐
│ literal      │
│ ---          │
│ decimal[*,2] │
╞══════════════╡
│ null         │
└──────────────┘
>>> pl.select(pl.lit("1,000.12").str.replace_all(r",", "").str.to_decimal())
shape: (1, 1)
┌──────────────┐
│ literal      │
│ ---          │
│ decimal[*,2] │
╞══════════════╡
│ 1000.12      │
└──────────────┘

(It may be useful for the docs to explain this?)

The original example in this issue also seems to work now.

pl.DataFrame({
    'x': ['1', '1.23', '1.2345'],
}).with_columns(
    d=pl.col("x").cast(pl.Decimal(scale=2)),
)

# shape: (3, 2)
# ┌────────┬──────────────┐
# │ x      ┆ d            │
# │ ---    ┆ ---          │
# │ str    ┆ decimal[*,2] │
# ╞════════╪══════════════╡
# │ 1      ┆ 1.00         │
# │ 1.23   ┆ 1.23         │
# │ 1.2345 ┆ 1.23         │
# └────────┴──────────────┘
matquant14 commented 3 weeks ago

ahh, thanks @cmdlineluser. I thought that was being done under the hood, but my oversight. I got it to work now. appreciate the feedback.