pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.57k stars 1.69k forks source link

Parquet file writer uses non-compliant list element field name #17100

Open cgbur opened 1 week ago

cgbur commented 1 week ago

Checks

Reproducible example

import polars as pl
import pyarrow.parquet as pq

df = pl.DataFrame(
    {
        "a": [[1, 2], [1, 1, 1]],
    }
)

df.write_parquet("example.parquet", use_pyarrow=False)
print("with polars")
print(pq.read_schema("example.parquet"))
print()
df.write_parquet("example.parquet", use_pyarrow=True)
print("with pyarrow")
print(pq.read_schema("example.parquet"))
with polars
a: large_list<item: int64>
  child 0, item: int64

with pyarrow
a: large_list<element: int64>
  child 0, element: int64

Log output

No response

Issue description

When generating Parquet files using Polars with use_pyarrow=False (using the polars parquet writer), the list element field name is set to item instead of element. This appears to be non-compliant with the Parquet specification for nested types.

According to the Parquet specification, the correct field name for the single item in a LIST should be element.

This can cause issues when working with other libraries or tools that expect Parquet files to follow the specification. For example, when trying to add these files to an Apache Iceberg table using pyiceberg, it results in errors due to the unexpected field name.

Expected behavior

The issue arises because when writing out Parquet files, the schema data types are converted to arrow format here:

https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L575

However, perhaps the confusion arises because in arrow, the List single element name is often item not element.

https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message

import pyarrow as pa

py_list = pa.array([[1, 2, 3], [1, 2]])
print(py_list.type)
list<item: int64>

I do not think we want to change the default for all arrow conversions, perhaps another flag similar to pl_flavor for is_parquet and then accordingly set the item strings to element so the resulting parquet file matches the spec.

Installed versions

``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.10.218-186.862.amzn2int.x86_64-x86_64-with-glibc2.39 Python: 3.12.3 (main, Apr 9 2024, 08:09:14) [GCC 13.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: 3.8.4 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 16.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 1 week ago

I believe we fixed this recently. @coastalwhite can confirm.

coastalwhite commented 1 week ago

No, this is not yet resolved. It requires filtering the item name when we convert to a Parquet schema. Since just mapping all items named item to element seems quite naive, I did not immediately solve this.

whichwit commented 2 days ago

Following. Experiencing the same issue where list type columns in Polars cannot be used by PyIceberg (via PyArrow).

Will this be resolved soon (with the solution potentially naive) or is there a workaround?