pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.91k stars 1.93k forks source link

Parquet does not support writing empty structs for Polars 1.10.0 #19352

Open GibbonJojo opened 1 day ago

GibbonJojo commented 1 day ago

Checks

Reproducible example

>>> import polars as pl
>>> data = [
...     {"name": "one", "attr": {}},
...     {"name": "two", "attr": {}}
... ]
>>> df = pl.DataFrame(data)
>>> df
shape: (2, 2)
┌──────┬───────────┐
│ name ┆ attr      │
│ ---  ┆ ---       │
│ str  ┆ struct[0] │
╞══════╪═══════════╡
│ one  ┆ {}        │
│ two  ┆ {}        │
└──────┴───────────┘
>>> df.write_parquet("test.pq")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jo/miniconda3/envs/test/lib/python3.9/site-packages/polars/dataframe/frame.py", line 3847, in write_parquet
    self._df.write_parquet(
polars.exceptions.InvalidOperationError: Parquet does not support writing empty structs

Log output

No response

Issue description

This worked in 1.9:

# polars 1.9.0
>>> data = [
...     {"name": "one", "attr": {}},
...     {"name": "two", "attr": {}}
... ]
>>> df = pl.DataFrame(data)
>>> df
shape: (2, 2)
┌──────┬───────────┐
│ name ┆ attr      │
│ ---  ┆ ---       │
│ str  ┆ struct[1] │
╞══════╪═══════════╡
│ one  ┆ null      │
│ two  ┆ null      │
└──────┴───────────┘
>>> df.write_parquet("test.pq")

Apparently the dtype for all-empty structs changed from struct[1] to struct[0] as well

There's also a difference in fetching the structs:

# polars 1.10.0
>>> df.row(0)
('one', {})

# polars 1.9.0
>>> df.row(0)
('one', None)

It also works when atleast one struct/dict is not empty:

# polars 1.10.0
>>> data = [
...     {"name": "one", "attr": {}},
...     {"name": "two", "attr": {"foo": "bar"}}
... ]
>>> df = pl.DataFrame(data)
>>> df
shape: (2, 2)
┌──────┬───────────┐
│ name ┆ attr      │
│ ---  ┆ ---       │
│ str  ┆ struct[1] │
╞══════╪═══════════╡
│ one  ┆ {null}    │
│ two  ┆ {"bar"}   │
└──────┴───────────┘
>>> df.write_parquet("test.pq")

Expected behavior

Polars 1.10 should write the empty structs, just as 1.9 did

Installed versions

``` >>> pl.show_versions() --------Version info--------- Polars: 1.10.0 Index type: UInt32 Platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35 Python: 3.9.20 | packaged by conda-forge | (main, Sep 30 2024, 17:49:10) [GCC 13.3.0] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio numpy 2.0.2 openpyxl pandas 2.2.3 pyarrow 17.0.0 pydantic 2.9.2 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
ritchie46 commented 1 day ago

@coastalwhite

coastalwhite commented 1 day ago

I would not call this a regression. Parquet cannot store zero-field structs.

Before, we didn't support structs with zero fields in Polars, so they would automatically be cast to structs with 1 field.

GibbonJojo commented 1 day ago

I don't think breaking changes should be introduced in minor versions.

Generally, I think it's fine to not allow zero fields if parquet doesn't allow it. Even though in my opinion, I would prefer it, since it can happen regularly in my use case. But I should not expect my code to break by updating minor versions either way.

Edit: Sorry, sometimes my phones keyboard randomly closes, so I accidentally clicked on close.

ritchie46 commented 13 hours ago

I don't think breaking changes should be introduced in minor versions.

You don't have to state the obvious. We understand that. This is due to a bug fix and those have side effects.