trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.88k stars 2.86k forks source link

Trino is not ignoring absent partitions for S3 SymlinkTextInputFormat #10041

Open ruivoh opened 2 years ago

ruivoh commented 2 years ago

Tested on AWS EMR 6.4.0 + Trino 359

If glue has a partition created, but the path on S3 is inexistent, queries fail to run.

Scenario: Partitions for every hour (0-24) are created in Glue where dt = 2021-10-01 S3 only have data/folders for the hour 9 and 13 on that same day.

Query: SELECT * FROM table WHERE dt = '2021-10-01';

Error: Error parsing symlinks from: s3://path-to-bucket/_symlink_format_manifest/dt=2021-10-01/hour=12

Config used: { "Classification" : "trino-connector-hive", "Properties" : { "hive.metastore" : "glue", "hive.ignore-absent-partitions" : "true", }, }

I can see this function is the one triggering the error: https://github.com/trinodb/trino/blob/deaff227dea6be94d851345848ec828b53c6b1aa/plugin/trino-hive/src/main/java/io/trino/plugin/hive/BackgroundHiveSplitLoader.java#L918

And the inputFormat is indeed SymlinkTextInputFormat. The function doesn't seems to take in consideration ignoreAbsentPartitions.

Tested with both hive.ignore-absent-partitionsconfig and session hive.ignore_absent_partitions

Same query runs successfully on Presto and Athena using same glue catalog.

Praveen2112 commented 2 years ago

Can you please share the entire stacktrace ?

ruivoh commented 2 years ago

Can you please share the entire stacktrace ?

Hey, sure. Is this what you mean?

trino> SELECT COUNT(*) as count, MIN(event_time), MAX(event_time)
    -> FROM table
    -> WHERE dt = '2021-11-04'
    -> GROUP BY aws_region
    -> ORDER BY count DESC;

Query 20211123_131330_00007_p9wmm, FAILED, 1 node
Splits: 150 total, 0 done (0.00%)
2.24 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20211123_131740_00008_p9wmm failed: Error parsing symlinks from: s3://s3-path/_symlink_format_manifest/dt=2021-11-04/hour=16
io.trino.spi.TrinoException: Error parsing symlinks from: s3://s3-path/_s
ymlink_format_manifest/dt=2021-11-04/hour=16
        at io.trino.plugin.hive.BackgroundHiveSplitLoader.getTargetPathsFromSymlink(BackgroundHiveSplitLoader.java:928)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:394)
        at io.trino.plugin.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
        at io.trino.plugin.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)
        at io.trino.plugin.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)
        at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:95)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:392)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:345)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:269)
        at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
        at io.trino.$gen.Trino_359____20211122_182204_2.run(Unknown Source)
        at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.FileNotFoundException: File s3://s3-path/_symlink_format_manifest/dt=2021-11-04/hour=16 does not exist.
        at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:698)
        at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
        at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
        at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:280)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1884)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1912)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1962)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1956)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader.getTargetPathsFromSymlink(BackgroundHiveSplitLoader.java:915)
        ... 17 more
ruivoh commented 2 years ago

@Praveen2112 if you need more info let me know and I'll gladly provide.