pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.9k stars 1.92k forks source link

Struct with decimals not read properly in parquet #16692

Open theelderbeever opened 4 months ago

theelderbeever commented 4 months ago

Checks

Marked this as a python bug since that is where I encountered it however, I would expect the same bug to exist in Rust.

Reproducible example

Minimum reproducible example that I can figure out. Removal of ANY row/field or unnesting the top level struct results in a success.

import polars as pl

data = [
    {
        "plan": {
            "metadata": {"a": 1},
            "tiers": [
                {
                    "unit_amount_decimal": "0.0001",
                }
            ],
        }
    },
    {
        "plan": {
            "metadata": {"a": 1},
            "tiers": [
                # {
                #     "unit_amount_decimal": "0",
                # },
                {
                    "unit_amount_decimal": "0.0001",
                },
            ],
        }
    },
]

pl.DataFrame(data).write_parquet("test.parquet")

Table

plan { struct[2] }
{{1},[{"0.0001"}]}
{{1},[{"0.0001"}]}

Log output

❯ RUST_BACKTRACE=1 POLARS_VERBOSE=1 python notebooks/test.py
thread '<unnamed>' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n                         However, the values at index 1 have a length of 3, which is different from values at index 0, 0."))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/taylorbeever/git/quiknode-labs/billing/billing-platform-pipelines/notebooks/test.py", line 35, in <module>
    pl.DataFrame(data).write_parquet("test.parquet")
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py", line 3292, in write_parquet
    self._df.write_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n                         However, the values at index 1 have a length of 3, which is different from values at index 0, 0."))

Issue description

I am attempting to write out a parquet file of data that I fetched from the Stripe api. The api json response is extremely nested. When writing the data structure in the example the write fails due to a differing number of children. If use_pyarrow=True is set then the write will be successful.

From trial and error it seems to very specifically require a column which is a struct containing a struct field and a list field. Any values deeper than col.struct.{struct,list} don't appear to affect the outcome and the list can in fact be empty and it will still fail.

Expected behavior

Dataframe should write to parquet successfully.

Installed versions

``` --------Version info--------- Polars: 0.20.30 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: 1.0.0 cloudpickle: 2.2.1 connectorx: 0.3.3 deltalake: fastexcel: fsspec: 2023.12.2 gevent: hvplot: 0.9.2 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: 2.0.29 torch: xlsx2csv: xlsxwriter: ```
theelderbeever commented 4 months ago

Also read/write between the use_pyarrow=True is equivalent

df = pl.DataFrame(data)
df.write_parquet("test.parquet", use_pyarrow=True)
df == pl.read_parquet("test.parquet")
plan { bool }
true
true
cmdlineluser commented 4 months ago

This is fixed in 0.20.31

theelderbeever commented 4 months ago

@cmdlineluser I completely didn't catch that there was a release just 2 days ago... Just upgraded.

theelderbeever commented 4 months ago

@cmdlineluser Still broken for read operations when the internal values are Decimals AND some other type.

from decimal import Decimal
print(pl.__version__)

pl.Config.activate_decimals(True)

df = pl.DataFrame(
    [
        {
            "tiers": [
                {
                    "in_tier": 10.0,
                    "overage_cents": Decimal("0E-12"),
                },
                {
                    "in_tier": 0.0,
                    "overage_cents": Decimal("0E-12"),
                },
            ]
        },
        {
            "tiers": [
                {
                    "in_tier": 10.0,
                    "overage_cents": Decimal("0.001000000000"),
                }
            ]
        },
    ]
)

print(df.schema)

df.write_parquet("tiers.parquet")
pl.read_parquet("tiers.parquet")
theelderbeever commented 4 months ago

Additionally, the decimal values inside the struct aren't being written or read from the file... use_pyarrow=True during the write correctly writes the decimal values.

from decimal import Decimal
print(pl.__version__)

pl.Config.activate_decimals(True)

df = pl.DataFrame(
    [
        {
            "tiers": [
                {
                    # "in_tier": 10.0,
                    "overage_cents": Decimal("0E-12"),
                },
                {
                    # "in_tier": 0.0,
                    "overage_cents": Decimal("0E-12"),
                },
            ]
        },
        {
            "tiers": [
                {
                    # "in_tier": 10.0,
                    "overage_cents": Decimal("0.001000000000"),
                }
            ]
        },
    ]
)

print(df.schema)

print(df)

df.write_parquet("tiers.parquet")
print(pl.read_parquet("tiers.parquet"))

"""
0.20.31
OrderedDict([('tiers', List(Struct({'overage_cents': Decimal(precision=None, scale=12)})))])
| tiers                                |
| ---                                  |
| list[struct[1]]                      |
|--------------------------------------|
| [{0.000000000000}, {0.000000000000}] |
| [{0.001000000000}]                   |

| tiers           |
| ---             |
| list[struct[1]] |
|-----------------|
"""
cmdlineluser commented 4 months ago

D'oh - apologies.

Just for reference, the previous report was

(But wasn't decimal related.)

theelderbeever commented 4 months ago

@cmdlineluser no worries. Want me to open a separate issue for decimals specifically?