Open rgelsi opened 1 year ago
Can you attache the archived directory instead of text so that we can reproduce the issue easily? Also, is it possible to reproduce with Spark SQL?
Code to reproduce:
from deltalake import DeltaTable
from deltalake.writer import write_deltalake
import pandas as pd
path = "/path/to/test_table.delta"
data = {
'col1': ['a', 'b'],
'col2': [1, 2]
}
df = pd.DataFrame.from_dict(data)
write_deltalake(path, df)
dt = DeltaTable(path)
dt.create_checkpoint()
I can read the table without any problems with Spark SQL.
Also, if I generate a checkpoint with Spark after creating the checkpoint through the Python package, the table can be read again with Trino.
Thanks, I could reproduce the issue in my laptop. Looking into the details.
Disabling parquet.use-column-index
config property and restarting the cluster will help as the workaround.
There's delta.parquet_use_column_index
session property, but it's not used when reading checkpoint parquet files.
The checkpoint created by the Python Delta Lake package cannot be read by Trino (Version 424).
Saw that the checkpoint parquet file created by the Python package has a different scheme, respectively the fields are arranged differently than in the checkpoint created by PySpark . The content of the line containing the metadata is identical.
Schema of checkpoint parquet file created by the Python Delta Lake package:
Schema of checkpoint parquet file created by PySpark: