pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.25k stars 1.85k forks source link

`File out of specification: The max_value of statistics MUST be plain encoded` when writing nested parquet with Rust engine #17948

Closed theelderbeever closed 1 month ago

theelderbeever commented 1 month ago

Checks

Reproducible example

pl.DataFrame(
    [
        {
            "struct": {
                "struct": {
                    "struct": {"a": None},
                     # Some field following the struct field is necessary. Type seems irrelevant
                    "str": "hello",
                    # "i64": 123456789
                    # "bool": False,
                },
            }
        },
    ]
).write_parquet("womp.parquet")

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
/var/folders/72/8wtgnd0963gfwzpb16q229340000gn/T/ipykernel_32301/1738335881.py in ?()
     10                 },
     11             }
     12         },
     13     ]
---> 14 ).write_parquet("discount.parquet")

~/.pyenv/versions/3.11.8/envs/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, file, compression, compression_level, statistics, row_group_size, data_page_size, use_pyarrow, pyarrow_options, partition_by, partition_chunk_size_bytes)
   3626 
   3627             if isinstance(partition_by, str):
   3628                 partition_by = [partition_by]
   3629 
-> 3630             self._df.write_parquet(
   3631                 file,
   3632                 compression,
   3633                 compression_level,

ComputeError: parquet: File out of specification: The max_value of statistics MUST be plain encoded

Issue description

Polars default parquet engine fails with a metadata statistics error which does not occur with use_pyarrow=True.

Expected behavior

Polars parquet writer's should both be able to write the same dataframe.

Installed versions

``` --------Version info--------- Polars: 1.3.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: 1.1.0 cloudpickle: 2.2.1 connectorx: 0.3.3 deltalake: fastexcel: fsspec: 2023.12.2 gevent: great_tables: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 17.0.0 pydantic: 2.5.3 pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: xlsxwriter: ```
theelderbeever commented 1 month ago

Additionally, the following structure can be toggled between two separate errors occurring.

  1. "has_more": False,

    PanicException: the offset of the new Buffer cannot exceed the existing length

  2. # "has_more": False,

    ComputeError: parquet: File out of specification: The max_value of statistics MUST be plain encoded

pl.DataFrame(
    [
        {
            "items": {
                "data": [
                    {
                        "plan": {
                            "tiers": [
                                {
                                    "up_to": None,
                                }
                            ],
                            "tiers_mode": "volume",
                        },
                    },
                    {
                        "plan": {
                            "tiers": [
                                {
                                    "up_to": None,
                                }
                            ],
                            "tiers_mode": "volume",
                        },
                    },
                ],
                "has_more": False, # comment this line to get a buffer size error
            }
        }
    ]
).write_parquet("items.parquet")
fzyzcjy commented 1 month ago

Having the same error here, with reproduction:

print(pl.__version__)

df = pl.DataFrame([
    {
        'a': {
            'b': [{'c': 'x'}],
            'd': 10
        }
    }
])
print(df.dtypes)
df.write_parquet('/tmp/a.parquet')