trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.38k stars 2.98k forks source link

Cannot read Delta Lake table with a checkpoint created by the python deltalake package #18760

Open rgelsi opened 1 year ago

rgelsi commented 1 year ago

The checkpoint created by the Python Delta Lake package cannot be read by Trino (Version 424).

Saw that the checkpoint parquet file created by the Python package has a different scheme, respectively the fields are arranged differently than in the checkpoint created by PySpark . The content of the line containing the metadata is identical.

io.trino.spi.TrinoException: Error opening Hive split s3a://testbucket/test_table.delta/_delta_log/00000000000000000023.checkpoint.parquet (offset=0, length=35109): null value in entry: [metadata, name]=null
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:306)
    at io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointEntryIterator.<init>(CheckpointEntryIterator.java:184)
    at io.trino.plugin.deltalake.transactionlog.TableSnapshot.getCheckpointTransactionLogEntries(TableSnapshot.java:228)
    at io.trino.plugin.deltalake.transactionlog.TableSnapshot.getCheckpointTransactionLogEntries(TableSnapshot.java:192)
    at io.trino.plugin.deltalake.transactionlog.TransactionLogAccess.getEntries(TransactionLogAccess.java:377)
    at io.trino.plugin.deltalake.transactionlog.TransactionLogAccess.getEntries(TransactionLogAccess.java:400)
    at io.trino.plugin.deltalake.transactionlog.TransactionLogAccess.getMetadataEntry(TransactionLogAccess.java:202)
    at io.trino.plugin.deltalake.DeltaLakeMetadata.getTableHandle(DeltaLakeMetadata.java:476)
    at io.trino.plugin.deltalake.DeltaLakeMetadata.getTableHandle(DeltaLakeMetadata.java:295)
    at io.trino.spi.connector.ConnectorMetadata.getTableHandle(ConnectorMetadata.java:132)
    at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getTableHandle(ClassLoaderSafeConnectorMetadata.java:1121)
    at io.trino.tracing.TracingConnectorMetadata.getTableHandle(TracingConnectorMetadata.java:146)
    at io.trino.metadata.MetadataManager.lambda$getTableHandle$5(MetadataManager.java:291)
    at java.base/java.util.Optional.flatMap(Optional.java:289)
    at io.trino.metadata.MetadataManager.getTableHandle(MetadataManager.java:285)
    at io.trino.metadata.MetadataManager.getRedirectionAwareTableHandle(MetadataManager.java:1691)
    at io.trino.metadata.MetadataManager.getRedirectionAwareTableHandle(MetadataManager.java:1683)
    at io.trino.tracing.TracingMetadata.getRedirectionAwareTableHandle(TracingMetadata.java:1331)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.getTableHandle(StatementAnalyzer.java:5453)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitTable(StatementAnalyzer.java:2226)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitTable(StatementAnalyzer.java:493)
    at io.trino.sql.tree.Table.accept(Table.java:60)
    at io.trino.sql.tree.AstVisitor.process(AstVisitor.java:27)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.process(StatementAnalyzer.java:512)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.analyzeFrom(StatementAnalyzer.java:4512)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitQuerySpecification(StatementAnalyzer.java:2989)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitQuerySpecification(StatementAnalyzer.java:493)
    at io.trino.sql.tree.QuerySpecification.accept(QuerySpecification.java:155)
    at io.trino.sql.tree.AstVisitor.process(AstVisitor.java:27)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.process(StatementAnalyzer.java:512)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.process(StatementAnalyzer.java:520)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitQuery(StatementAnalyzer.java:1508)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitQuery(StatementAnalyzer.java:493)
    at io.trino.sql.tree.Query.accept(Query.java:107)
    at io.trino.sql.tree.AstVisitor.process(AstVisitor.java:27)
    at io.trino.sql.analyzer.StatementAnalyzer$Visitor.process(StatementAnalyzer.java:512)
    at io.trino.sql.analyzer.StatementAnalyzer.analyze(StatementAnalyzer.java:472)
    at io.trino.sql.analyzer.StatementAnalyzer.analyze(StatementAnalyzer.java:461)
    at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:96)
    at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:85)
    at io.trino.execution.SqlQueryExecution.analyze(SqlQueryExecution.java:270)
    at io.trino.execution.SqlQueryExecution.<init>(SqlQueryExecution.java:205)
    at io.trino.execution.SqlQueryExecution$SqlQueryExecutionFactory.createQueryExecution(SqlQueryExecution.java:844)
    at io.trino.dispatcher.LocalDispatchQueryFactory.lambda$createDispatchQuery$0(LocalDispatchQueryFactory.java:153)
    at io.trino.$gen.Trino_424____20230821_115019_2.call(Unknown Source)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: null value in entry: [metadata, name]=null
    at com.google.common.collect.CollectPreconditions.checkEntryNotNull(CollectPreconditions.java:33)
    at com.google.common.collect.ImmutableMapEntry.<init>(ImmutableMapEntry.java:54)
    at com.google.common.collect.ImmutableMap.entryOf(ImmutableMap.java:341)
    at com.google.common.collect.ImmutableMap$Builder.put(ImmutableMap.java:450)
    at com.google.common.collect.CollectCollectors.lambda$toImmutableMap$7(CollectCollectors.java:196)
    at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:992)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
    at io.trino.parquet.reader.TrinoColumnIndexStore.loadIndexes(TrinoColumnIndexStore.java:146)
    at io.trino.parquet.reader.TrinoColumnIndexStore.getOffsetIndex(TrinoColumnIndexStore.java:119)
    at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.applyPredicate(ColumnIndexFilter.java:182)
    at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:164)
    at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:57)
    at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:445)
    at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:87)
    at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:82)
    at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
    at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:82)
    at io.trino.parquet.reader.ParquetReader.calculateFilteredRowRanges(ParquetReader.java:559)
    at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:196)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$2(ParquetPageSourceFactory.java:286)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:477)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:288)
    ... 50 more

Schema of checkpoint parquet file created by the Python Delta Lake package:

root
 |-- metaData: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- schemaString: string (nullable = true)
 |    |-- createdTime: long (nullable = true)
 |    |-- partitionColumns: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- configuration: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- format: struct (nullable = true)
 |    |    |-- provider: string (nullable = true)
 |    |    |-- options: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- protocol: struct (nullable = true)
 |    |-- minReaderVersion: integer (nullable = true)
 |    |-- minWriterVersion: integer (nullable = true)
 |-- txn: struct (nullable = true)
 |    |-- appId: string (nullable = true)
 |    |-- version: long (nullable = true)
 |-- add: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- size: long (nullable = true)
 |    |-- modificationTime: long (nullable = true)
 |    |-- dataChange: boolean (nullable = true)
 |    |-- stats: string (nullable = true)
 |    |-- partitionValues: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- tags: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- stats_parsed: struct (nullable = true)
 |    |    |-- numRecords: long (nullable = true)
 |    |    |-- minValues: struct (nullable = true)
 |    |    |    |-- task_fk: string (nullable = true)
 |    |    |    |-- erp: long (nullable = true)
 |    |    |    |-- gqs: long (nullable = true)
 |    |    |    |-- timestamp: timestamp_ntz (nullable = true)
 |    |    |    |-- site: string (nullable = true)
 |    |    |-- maxValues: struct (nullable = true)
 |    |    |    |-- task_fk: string (nullable = true)
 |    |    |    |-- erp: long (nullable = true)
 |    |    |    |-- gqs: long (nullable = true)
 |    |    |    |-- timestamp: timestamp_ntz (nullable = true)
 |    |    |    |-- site: string (nullable = true)
 |    |    |-- nullCount: struct (nullable = true)
 |    |    |    |-- task_fk: long (nullable = true)
 |    |    |    |-- erp: long (nullable = true)
 |    |    |    |-- gqs: long (nullable = true)
 |    |    |    |-- timestamp: long (nullable = true)
 |    |    |    |-- site: long (nullable = true)
 |-- remove: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- deletionTimestamp: long (nullable = true)
 |    |-- dataChange: boolean (nullable = true)
 |    |-- extendedFileMetadata: boolean (nullable = true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+----+----+------+
|metaData                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |protocol|txn |add |remove|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+----+----+------+
|{841e8772-7925-41e6-8998-56edf8f36c35, null, null, {"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"long","nullable":true,"metadata":{}},{"name":"col3","type":"long","nullable":true,"metadata":{}},{"name":"col4","type":"timestamp","nullable":true,"metadata":{}},{"name":"col5","type":"string","nullable":true,"metadata":{}}]}, 1689033614464, [], {delta.logRetentionDuration -> interval 7 days}, {parquet, {}}}      |null    |null|null|null  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+----+----+------+

Schema of checkpoint parquet file created by PySpark:

root
 |-- txn: struct (nullable = true)
 |    |-- appId: string (nullable = true)
 |    |-- version: long (nullable = true)
 |    |-- lastUpdated: long (nullable = true)
 |-- add: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- partitionValues: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- modificationTime: long (nullable = true)
 |    |-- dataChange: boolean (nullable = true)
 |    |-- tags: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- deletionVector: struct (nullable = true)
 |    |    |-- storageType: string (nullable = true)
 |    |    |-- pathOrInlineDv: string (nullable = true)
 |    |    |-- offset: integer (nullable = true)
 |    |    |-- sizeInBytes: integer (nullable = true)
 |    |    |-- cardinality: long (nullable = true)
 |    |    |-- maxRowIndex: long (nullable = true)
 |    |-- stats: string (nullable = true)
 |-- remove: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- deletionTimestamp: long (nullable = true)
 |    |-- dataChange: boolean (nullable = true)
 |    |-- extendedFileMetadata: boolean (nullable = true)
 |    |-- partitionValues: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- deletionVector: struct (nullable = true)
 |    |    |-- storageType: string (nullable = true)
 |    |    |-- pathOrInlineDv: string (nullable = true)
 |    |    |-- offset: integer (nullable = true)
 |    |    |-- sizeInBytes: integer (nullable = true)
 |    |    |-- cardinality: long (nullable = true)
 |    |    |-- maxRowIndex: long (nullable = true)
 |-- metaData: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- format: struct (nullable = true)
 |    |    |-- provider: string (nullable = true)
 |    |    |-- options: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |-- schemaString: string (nullable = true)
 |    |-- partitionColumns: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- configuration: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- createdTime: long (nullable = true)
 |-- protocol: struct (nullable = true)
 |    |-- minReaderVersion: integer (nullable = true)
 |    |-- minWriterVersion: integer (nullable = true)
 |    |-- readerFeatures: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- writerFeatures: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
+----+----+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|txn |add |remove|metaData                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |protocol|
+----+----+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|null|null|null  |{841e8772-7925-41e6-8998-56edf8f36c35, null, null, {parquet, {}}, {"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"long","nullable":true,"metadata":{}},{"name":"col3","type":"long","nullable":true,"metadata":{}},{"name":"col4","type":"timestamp","nullable":true,"metadata":{}},{"name":"col5","type":"string","nullable":true,"metadata":{}}]}, [], {delta.logRetentionDuration -> interval 7 days}, 1689033614464}      |null    |
+----+----+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
ebyhr commented 1 year ago

Can you attache the archived directory instead of text so that we can reproduce the issue easily? Also, is it possible to reproduce with Spark SQL?

rgelsi commented 1 year ago

Code to reproduce:

from deltalake import DeltaTable
from deltalake.writer import write_deltalake
import pandas as pd

path = "/path/to/test_table.delta"

data = {
    'col1': ['a', 'b'],
    'col2': [1, 2]
}
df = pd.DataFrame.from_dict(data)
write_deltalake(path, df)

dt = DeltaTable(path)
dt.create_checkpoint()

test_table.delta.zip

I can read the table without any problems with Spark SQL.

Also, if I generate a checkpoint with Spark after creating the checkpoint through the Python package, the table can be read again with Trino.

ebyhr commented 1 year ago

Thanks, I could reproduce the issue in my laptop. Looking into the details.

ebyhr commented 1 year ago

Disabling parquet.use-column-index config property and restarting the cluster will help as the workaround. There's delta.parquet_use_column_index session property, but it's not used when reading checkpoint parquet files.