Closed ion-elgreco closed 1 month ago
I'm not sure if I'm a huge fan of just silently returning null columns for missing columns - what if you misspelled a column?
Perhaps this could be some sort of opt-in option? Do you know why DataFusion and PyArrow chose this default behavior?
I'm not sure if I'm a huge fan of just silently returning null columns for missing columns - what if you misspelled a column?
Perhaps this could be some sort of opt-in option? Do you know why DataFusion and PyArrow chose this default behavior?
I am already very happy if it's just an opt-in on plan level for the Scan execution! You likely won't ever run into a misspelled column because the reader_schema
is constructed from the Delta Table Log, which in itself enforces always a correct and latest version of the data schema.
Not sure why it's the default but if this wasn't possible then in theory you have to rewrite all old parquet files to simply add a null column just so you could read newer parquet files where you added an additional column.
I also have this use case, but should be opt-in. Sometimes newer parquets have more cols and you don't want to touch old files to add them. I have this in a non-delta extract layer and currently have to read metadata in an extra step to know which cols are in which file
The reader schema should belong to the file. If you want to add null
columns if the schemas don't match that should be in the compute layer on top of that.
@ritchie46 that wouldn't work though, the way delta lake works is there is 1 single schema for the table, but each individual parquet could have a subset of the columns due to how schema evolution works.
If the reader_schema requires to get the schema of each single parquet to be exactly how the parquet is structured then, you would have to query each file metadata to read it.
DataFusion and PyArrow have no problem reading parquet tables with mixed schema's as long as you provide a top-level schema to read the dataset with.
See here the docs on DataFusion: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html#structfield.file_schema
Schema before projection is applied. It contains the **all columns that may** appear in the files. It does not include table partition columns that may be added.
If you fetch the metadata of the file you can get the file schema. That can be used.
If you fetch the metadata of the file you can get the file schema. That can be used.
My whole point is you can avoid that if you have an apriori on the correct schema
Closed as completed via https://github.com/pola-rs/polars/pull/18922
Checks
Reproducible example
Create 2 parquet files, one file having: {"foo":Utf8"}, the other file having {"foo":Utf8", "bar": "int64"}
Log output
Issue description
I am trying to make schema evolved delta tables readable with polars-deltalake, however Polars does not seem to automatically create null arrays when columns are missing from a parquet file when you read it with a reader schema.
Expected behavior
When you read a parquet with a reader_schema and only a subset of columns are availabe in the parquet file, then polars should create null arrays of the respective columns and it's types that are missing.
This is also the behavior of datafusion and pyarrow when you scan multiple parquets with a provided schema.
Installed versions