Open cgbur opened 1 week ago
I believe we fixed this recently. @coastalwhite can confirm.
No, this is not yet resolved. It requires filtering the item
name when we convert to a Parquet schema. Since just mapping all items named item
to element
seems quite naive, I did not immediately solve this.
Following. Experiencing the same issue where list type columns in Polars cannot be used by PyIceberg (via PyArrow).
Will this be resolved soon (with the solution potentially naive) or is there a workaround?
Checks
Reproducible example
Log output
No response
Issue description
When generating Parquet files using Polars with
use_pyarrow=False
(using the polars parquet writer), the list element field name is set toitem
instead ofelement
. This appears to be non-compliant with the Parquet specification for nested types.According to the Parquet specification, the correct field name for the single item in a LIST should be
element
.This can cause issues when working with other libraries or tools that expect Parquet files to follow the specification. For example, when trying to add these files to an Apache Iceberg table using pyiceberg, it results in errors due to the unexpected field name.
Expected behavior
The issue arises because when writing out Parquet files, the schema data types are converted to arrow format here:
https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L575
However, perhaps the confusion arises because in arrow, the List single element name is often
item
notelement
.https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message
I do not think we want to change the default for all arrow conversions, perhaps another flag similar to
pl_flavor
foris_parquet
and then accordingly set theitem
strings toelement
so the resulting parquet file matches the spec.Installed versions