Closed bluzy closed 1 year ago
For the record i tried to reproduce this with Trino alone
USE iceberg.tpch;
CREATE TABLE t(foo row(x int), bar row(x int));
INSERT INTO t VALUES (ROW(10), ROW(11));
SELECT * FROM t WHERE foo.x = 10 AND bar.x = 11;
but it succeeded
USE
CREATE TABLE
INSERT: 1 row
Query 20221214_154533_00045_jube3, FINISHED, 3 nodes
[...]
foo | bar
--------+--------
{x=10} | {x=11}
(1 row)
(Same when i add top-level field with same name too) (Same when I use names & types from the issue description).
I guess Spark writer may be writing more stats and we fail to process them. cc @raunaqmorarka
@bluzy can you share the full stacktrace and also the offending file?
@findepi
I tested on table created on trino, it works.
I checked differences of my table and trino generated table, and I found file formats are diffenrent.
Trino table uses ORC
, but spark table uses PARQUET
(by default).
I created table with parquet format, same error occured.
CREATE TABLE iceberg_trino_test_3(id varchar, nested1 row(id varchar), nested2 row(id varchar)) WITH (format = 'PARQUET');
INSERT INTO iceberg_trino_test_3 VALUES ('1', ROW('2'), ROW('3'));
SELECT * FROM iceberg_trino_test_3 WHERE nested1.id = '2' AND nested2.id = '3';
[2022-12-15 09:44:49] java.lang.IllegalArgumentException: Multiple entries with same key: [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[3]}] ] and [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ]
The stack trace is:
io.trino.spi.TrinoException: Error opening Iceberg split hdfs://warehouse/test_db.db/iceberg_trino_test_3/data/20221215_004437_00117_ngidj-6c2075ed-9c42-4a6f-b85e-76e6cf77e988.parquet (offset=0, length=451): Multiple entries with same key: [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[3]}] ] and [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ]
at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:915)
at io.trino.plugin.iceberg.IcebergPageSourceProvider.createDataPageSource(IcebergPageSourceProvider.java:402)
at io.trino.plugin.iceberg.IcebergPageSourceProvider.createPageSource(IcebergPageSourceProvider.java:279)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:268)
at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:196)
at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$3(WorkProcessorUtils.java:219)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$4(WorkProcessorUtils.java:234)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:150)
at io.trino.operator.Driver.processInternal(Driver.java:410)
at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
at io.trino.operator.Driver.tryWithLock(Driver.java:698)
at io.trino.operator.Driver.process(Driver.java:305)
at io.trino.operator.Driver.processForDuration(Driver.java:276)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
at io.trino.$gen.Trino_387____20221205_072421_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Multiple entries with same key: [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[3]}] ] and [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ]
at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:377)
at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:371)
at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:241)
at com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:132)
at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:94)
at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:573)
at com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:601)
at io.trino.plugin.iceberg.IcebergPageSourceProvider.getParquetTupleDomain(IcebergPageSourceProvider.java:1038)
at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:809)
... 31 more
Below query is okay, so I changed title clearly. the problem seems to be occurred between top-level field and nested field
CREATE TABLE iceberg_trino_test_4(nested1 row(id varchar), nested2 row(id varchar)) WITH (format = 'PARQUET');
INSERT INTO iceberg_trino_test_4 VALUES (ROW('2'), ROW('3'));
SELECT * FROM iceberg_trino_test_4 WHERE nested1.id = '2' AND nested2.id = '3';
We are using Trino 387
https://github.com/trinodb/trino/pull/15408 pr may fix this issue.
@bluzy I confirmed #15408 fixes the failure. Do you want to contribute to add a test?
@ebyhr Thank you, I think current test case is enough. :)
I guess trino doesn't support predicate pushdown for nested field, right? Is there some plan to support it?
@bluzy Actually, it's not enough. @pajaks @mx123 Are you interested in adding the test?
Is there some plan to support it?
Actually, it's not enough.
@ebyhr can you please elaborate on what you'd want to have covered?
Duplicated names failure mentioned in https://github.com/trinodb/trino/issues/15393#issuecomment-1352413417 isn't covered by #15408
@ebyhr My CLA request is confirmed just now. I'll contribute some test cases for this issue.
@pajaks is working on #15437. Could you contribute to other issues?
@ebyhr Ah i missed it. I'll find other issue later :) thank you.
We are using Trino 387.
I tried to use Iceberg connector, but some queries are failed with
Multiple entries with same key
error.For verify the problem, I've created a table for test as below using spark sql:
Then tried to query on Trino, and got error. (The query works well on Hive)
Query failed (#20221214_094933_00231_ngidj): Error opening Iceberg split hdfs://warehouse/test_db.db/iceberg_trino_test/data/00000-0-9c521b08-195f-4246-bbec-064bc007fe65-00001.parquet (offset=4, length=1029): Multiple entries with same key: [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ] and [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[1]}] ] java.lang.IllegalArgumentException: Multiple entries with same key: [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ] and [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[1]}] ]
I think that iceberg connector aware just last field name for get
field-id
, not including whole nested structure.