trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.42k stars 3k forks source link

Unable to query for iceberg table if there are same field names in top level and nested struct #15393

Closed bluzy closed 1 year ago

bluzy commented 1 year ago

We are using Trino 387.

I tried to use Iceberg connector, but some queries are failed with Multiple entries with same key error.

For verify the problem, I've created a table for test as below using spark sql:

-- HiveCatalog: dev

create table dev.test_db.iceberg_trino_test (
 id string,
 nested1 struct<id: string>,
 nested2 struct<id: string>
) using iceberg;

insert into dev.test_db.iceberg_trino_test (id, nested1, nested2)
 values ('0', named_struct('id', '1'), named_struct('id', '2'));

Then tried to query on Trino, and got error. (The query works well on Hive)

select * from test_db.iceberg_trino_test
where nested1.id='1' and nested2.id='2';

Query failed (#20221214_094933_00231_ngidj): Error opening Iceberg split hdfs://warehouse/test_db.db/iceberg_trino_test/data/00000-0-9c521b08-195f-4246-bbec-064bc007fe65-00001.parquet (offset=4, length=1029): Multiple entries with same key: [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ] and [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[1]}] ] java.lang.IllegalArgumentException: Multiple entries with same key: [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ] and [id] required binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[1]}] ]

I think that iceberg connector aware just last field name for get field-id, not including whole nested structure.

findepi commented 1 year ago

For the record i tried to reproduce this with Trino alone

USE iceberg.tpch;
CREATE TABLE t(foo row(x int), bar row(x int));
INSERT INTO t VALUES (ROW(10), ROW(11));

SELECT * FROM t WHERE foo.x = 10 AND bar.x = 11;

but it succeeded

USE
CREATE TABLE
INSERT: 1 row

Query 20221214_154533_00045_jube3, FINISHED, 3 nodes
[...]

  foo   |  bar
--------+--------
 {x=10} | {x=11}
(1 row)

(Same when i add top-level field with same name too) (Same when I use names & types from the issue description).

I guess Spark writer may be writing more stats and we fail to process them. cc @raunaqmorarka

@bluzy can you share the full stacktrace and also the offending file?

bluzy commented 1 year ago

@findepi I tested on table created on trino, it works. I checked differences of my table and trino generated table, and I found file formats are diffenrent. Trino table uses ORC, but spark table uses PARQUET (by default).

I created table with parquet format, same error occured.

CREATE TABLE iceberg_trino_test_3(id varchar, nested1 row(id varchar), nested2 row(id varchar)) WITH (format = 'PARQUET');
INSERT INTO iceberg_trino_test_3 VALUES ('1', ROW('2'), ROW('3'));

SELECT * FROM iceberg_trino_test_3 WHERE nested1.id = '2' AND nested2.id = '3';

[2022-12-15 09:44:49] java.lang.IllegalArgumentException: Multiple entries with same key: [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[3]}] ] and [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ]

The stack trace is:

io.trino.spi.TrinoException: Error opening Iceberg split hdfs://warehouse/test_db.db/iceberg_trino_test_3/data/20221215_004437_00117_ngidj-6c2075ed-9c42-4a6f-b85e-76e6cf77e988.parquet (offset=0, length=451): Multiple entries with same key: [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[3]}] ] and [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ]
    at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:915)
    at io.trino.plugin.iceberg.IcebergPageSourceProvider.createDataPageSource(IcebergPageSourceProvider.java:402)
    at io.trino.plugin.iceberg.IcebergPageSourceProvider.createPageSource(IcebergPageSourceProvider.java:279)
    at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
    at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
    at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:268)
    at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:196)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
    at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$3(WorkProcessorUtils.java:219)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
    at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$4(WorkProcessorUtils.java:234)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
    at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:150)
    at io.trino.operator.Driver.processInternal(Driver.java:410)
    at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
    at io.trino.operator.Driver.tryWithLock(Driver.java:698)
    at io.trino.operator.Driver.process(Driver.java:305)
    at io.trino.operator.Driver.processForDuration(Driver.java:276)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
    at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
    at io.trino.$gen.Trino_387____20221205_072421_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Multiple entries with same key: [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[3]}] ] and [id] optional binary id (STRING) = 1=[ SortedRangeSet[type=varchar, ranges=1, {[2]}] ]
    at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:377)
    at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:371)
    at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:241)
    at com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:132)
    at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:94)
    at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:573)
    at com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:601)
    at io.trino.plugin.iceberg.IcebergPageSourceProvider.getParquetTupleDomain(IcebergPageSourceProvider.java:1038)
    at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:809)
    ... 31 more
bluzy commented 1 year ago

Below query is okay, so I changed title clearly. the problem seems to be occurred between top-level field and nested field

CREATE TABLE iceberg_trino_test_4(nested1 row(id varchar), nested2 row(id varchar)) WITH (format = 'PARQUET');
INSERT INTO iceberg_trino_test_4 VALUES (ROW('2'), ROW('3'));

SELECT * FROM iceberg_trino_test_4 WHERE nested1.id = '2' AND nested2.id = '3';

We are using Trino 387

bluzy commented 1 year ago

https://github.com/trinodb/trino/pull/15408 pr may fix this issue.

ebyhr commented 1 year ago

@bluzy I confirmed #15408 fixes the failure. Do you want to contribute to add a test?

bluzy commented 1 year ago

@ebyhr Thank you, I think current test case is enough. :)

I guess trino doesn't support predicate pushdown for nested field, right? Is there some plan to support it?

ebyhr commented 1 year ago

@bluzy Actually, it's not enough. @pajaks @mx123 Are you interested in adding the test?

Is there some plan to support it?

https://github.com/trinodb/trino/issues/8759

findepi commented 1 year ago

Actually, it's not enough.

@ebyhr can you please elaborate on what you'd want to have covered?

ebyhr commented 1 year ago

Duplicated names failure mentioned in https://github.com/trinodb/trino/issues/15393#issuecomment-1352413417 isn't covered by #15408

bluzy commented 1 year ago

@ebyhr My CLA request is confirmed just now. I'll contribute some test cases for this issue.

ebyhr commented 1 year ago

@pajaks is working on #15437. Could you contribute to other issues?

bluzy commented 1 year ago

@ebyhr Ah i missed it. I'll find other issue later :) thank you.