timvw / qv

Quickly view your data
Apache License 2.0
272 stars 16 forks source link

failed to map column projection- incompatible data types list field element vs item #31

Open AlJohri opened 1 year ago

AlJohri commented 1 year ago

I have a table that reads correctly using Spark + Delta Lake Libraries, but I'm having trouble reading via pv.

do you know which downstream dependency could be giving me this error?

Error: ArrowError(ExternalError(Execution("Failed to map column projection for field mycolumn. Incompatible data types List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })")))

I checked the schema from the delta transaction log and didn't see a hardcoded item or element:

❯ aws s3 cp s3://mybucket/year=2022/month=6/day=9/myprefix/_delta_log/00000000000000000000.json - | head -n 3 | tail -n 1 | jq '.metaData.schemaString | fromjson | .fields[] | select(.name == "mycolumn")'
{
  "name": "mycolumn",
  "type": {
    "type": "array",
    "elementType": "string",
    "containsNull": true
  },
  "nullable": true,
  "metadata": {}
}

When I look at the schema of a sample parquet file on s3, I do indeed see that the item in the list is called element:

pqrs schema =(s5cmd cat s3://mybucket/year=2022/month=6/day=9/myprefix/_partition=00001/part-00037-cb2e71c3-4f26-4de0-9e9a-18298489ccdc.c000.snappy.parquet)

...
message spark_schema {
  ...
  OPTIONAL group mycolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

I see this exact error is from here: https://github.com/apache/arrow-datafusion/blob/aad82fbb32dc1bb4d03e8b36297f8c9a3148df89/datafusion/core/src/physical_plan/file_format/mod.rs#L253

And I also see that element is hardcoded in delta-rs here:

https://github.com/delta-io/delta-rs/blob/83b8296fa5d55ebe050b022ed583dc57152221fe/rust/src/delta_arrow.rs#L38-L48 (pr: https://github.com/delta-io/delta-rs/pull/228)

But I can't seem to find where the schema mismatch is coming from.

timvw commented 1 year ago

Thanks for the feedback!

I've seen this issue pop up in the past https://github.com/datafusion-contrib/datafusion-catalogprovider-glue/issues/4#issuecomment-1151236162 but it fell off my radar... Seems that this could use a bit more investigation..

timvw commented 1 year ago

When I find some time this could help in tracing back the mismatch: https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/

AlJohri commented 1 year ago

@timvw I found the documentation for use_compliant_nested_type for PyArrow helpful for understanding this issue: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html

use_compliant_nested_type : bool, default False

Whether to write compliant Parquet nested type (lists) as defined here, defaults to False. For use_compliant_nested_type=True, this will write into a list with 3-level structure where the middle level, named list, is a repeated group with a single field named element:

   <list-repetition> group <name> (LIST) {
       repeated group list {
             <element-repetition> <element-type> element;
       }
   }

For use_compliant_nested_type=False, this will also write into a list with 3-level structure, where the name of the single field of the middle level list is taken from the element name for nested columns in Arrow, which defaults to item:

   <list-repetition> group <name> (LIST) {
       repeated group list {
           <element-repetition> <element-type> item;
       }
   }