Open AlJohri opened 1 year ago
Thanks for the feedback!
I've seen this issue pop up in the past https://github.com/datafusion-contrib/datafusion-catalogprovider-glue/issues/4#issuecomment-1151236162 but it fell off my radar... Seems that this could use a bit more investigation..
When I find some time this could help in tracing back the mismatch: https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/
@timvw I found the documentation for use_compliant_nested_type
for PyArrow helpful for understanding this issue: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
use_compliant_nested_type : bool, default False
Whether to write compliant Parquet nested type (lists) as defined here, defaults to
False
. Foruse_compliant_nested_type=True
, this will write into a list with 3-level structure where the middle level, namedlist
, is a repeated group with a single field namedelement
:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }
For
use_compliant_nested_type=False
, this will also write into a list with 3-level structure, where the name of the single field of the middle levellist
is taken from the element name for nested columns in Arrow, which defaults toitem
:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> item; } }
I have a table that reads correctly using Spark + Delta Lake Libraries, but I'm having trouble reading via
pv
.do you know which downstream dependency could be giving me this error?
I checked the schema from the delta transaction log and didn't see a hardcoded
item
orelement
:When I look at the schema of a sample parquet file on s3, I do indeed see that the item in the list is called
element
:I see this exact error is from here: https://github.com/apache/arrow-datafusion/blob/aad82fbb32dc1bb4d03e8b36297f8c9a3148df89/datafusion/core/src/physical_plan/file_format/mod.rs#L253
And I also see that
element
is hardcoded in delta-rs here:https://github.com/delta-io/delta-rs/blob/83b8296fa5d55ebe050b022ed583dc57152221fe/rust/src/delta_arrow.rs#L38-L48 (pr: https://github.com/delta-io/delta-rs/pull/228)
But I can't seem to find where the schema mismatch is coming from.