[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
# Make sure you extract the delta_table_with_nested.zip attachment below
pl.scan_delta("./delta_table_with_nested").collect()
print(pl.scan_delta(delta_table_path).collect())
df = pl.DataFrame({
"id": [1, 2],
"field1": ["value1", "value2"],
}).with_columns(pl.lit(1).alias("field2").cast(pl.Int32),pl.col("id").cast(pl.Int32))
df = df.with_columns(
pl.struct(["field1", "field2", pl.lit("x").alias("newcol")]).alias("X")
).select(["id", "X"])
# This will not work either. it works fine if you don't use the rust engine, but pyarrow engine is being deprecated by deltalake and has another issues with nested writes ("fake" nulls in non-null cols...) i'm not mentioning here
df.write_delta(delta_table_path, delta_write_options={"engine": "rust"}, mode="append")
df.write_delta(delta_table_path, delta_write_options={"engine": "rust"}, mode="overwrite")
I added a new column within X nested struct (typical case of schema-evolution, adding a new field)
Scans in polars not working anymore (they work in Spark, and in polars with the workaround below)
Also writes in polars not working anymore (they work in Spark, and in polars with the workaround below)
As a workaround I'm doing this instead of using scan_delta (and also instead of scan_parquet, I tried to use polars scan_parquet as much as possible, but I didn't manage to remove the schema when reading the parquets in python code as rust fails if I do so)
dl_tbl = _get_delta_lake_table(
table_path=table_path,
version=kwargs.get("version",None),
storage_options=storage_options,
delta_table_options=kwargs.get("delta_table_options",None),
)
# .... some other code stolen from pl.scan_delta
arrow_dataset=ds.dataset(urls, filesystem=fs, # type: ignore
partitioning=part, partition_base_dir=partition_base_dir,
format="parquet")
scan_df = pl.scan_pyarrow_dataset(arrow_dataset)
# This concat fixes the case in which someone added a new column in delta
# but no parquet file contains data with this column yet
# also should solve the issue of reading empty dataframes
# Similar to allow_missing_columns, but also works for nested structs
schema_from_parquets= scan_df.collect_schema()
aligned= align_schemas(empty_delta_schema_lf.collect_schema(), schema_from_parquets)
casted = scan_df.cast(aligned, strict=False) # Doing this to correct Timestamp TZ=None and TimeStamp=UTC conversion
return pl.concat([empty_delta_schema_lf, casted], how="diagonal_relaxed") #this will add the missing nested columns to the data
I do have this other code for the writing bug aswell
try:
batch.write_delta(
target=table_path, mode=mode_batch, storage_options=storage_options,
delta_write_options=delta_write_options, **kwargs
)
except Exception as e:
if isinstance(e, SchemaMismatchError) and table.schema==batch.schema:
logging.info("Retrying with schema override. "
f"Got SchemaMistMatchError {str(e)} and the schemas are the same."
" This usually happens when a new field is added to the table and "
"no parquet file contains it")
batch.write_delta(
target=table_path, mode=mode_batch, storage_options=storage_options,
delta_write_options={**delta_write_options,"schema_mode": "merge"}, **kwargs
)
else:
raise e
Expected behavior
Polars reading the data with a NULL for the parquet files who are missing the data
Checks
Reproducible example
delta_table_with_nested.zip
Make sure you
This is how I generated the table using Spark 3.4.1
Log output
Issue description
As a workaround I'm doing this instead of using scan_delta (and also instead of scan_parquet, I tried to use polars scan_parquet as much as possible, but I didn't manage to remove the schema when reading the parquets in python code as rust fails if I do so)
I do have this other code for the writing bug aswell
Expected behavior
Polars reading the data with a NULL for the parquet files who are missing the data
Installed versions