Open ruivoh opened 2 years ago
Can you please share the entire stacktrace ?
Can you please share the entire stacktrace ?
Hey, sure. Is this what you mean?
trino> SELECT COUNT(*) as count, MIN(event_time), MAX(event_time)
-> FROM table
-> WHERE dt = '2021-11-04'
-> GROUP BY aws_region
-> ORDER BY count DESC;
Query 20211123_131330_00007_p9wmm, FAILED, 1 node
Splits: 150 total, 0 done (0.00%)
2.24 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20211123_131740_00008_p9wmm failed: Error parsing symlinks from: s3://s3-path/_symlink_format_manifest/dt=2021-11-04/hour=16
io.trino.spi.TrinoException: Error parsing symlinks from: s3://s3-path/_s
ymlink_format_manifest/dt=2021-11-04/hour=16
at io.trino.plugin.hive.BackgroundHiveSplitLoader.getTargetPathsFromSymlink(BackgroundHiveSplitLoader.java:928)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:394)
at io.trino.plugin.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
at io.trino.plugin.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)
at io.trino.plugin.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)
at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:95)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:392)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:345)
at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:269)
at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
at io.trino.$gen.Trino_359____20211122_182204_2.run(Unknown Source)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.FileNotFoundException: File s3://s3-path/_symlink_format_manifest/dt=2021-11-04/hour=16 does not exist.
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:698)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:280)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1884)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1912)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1962)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1956)
at io.trino.plugin.hive.BackgroundHiveSplitLoader.getTargetPathsFromSymlink(BackgroundHiveSplitLoader.java:915)
... 17 more
@Praveen2112 if you need more info let me know and I'll gladly provide.
Tested on AWS EMR 6.4.0 + Trino 359
If glue has a partition created, but the path on S3 is inexistent, queries fail to run.
Scenario: Partitions for every hour (0-24) are created in Glue where dt = 2021-10-01 S3 only have data/folders for the hour 9 and 13 on that same day.
Query:
SELECT * FROM table WHERE dt = '2021-10-01';
Error:
Error parsing symlinks from: s3://path-to-bucket/_symlink_format_manifest/dt=2021-10-01/hour=12
Config used:
{ "Classification" : "trino-connector-hive", "Properties" : { "hive.metastore" : "glue", "hive.ignore-absent-partitions" : "true", }, }
I can see this function is the one triggering the error: https://github.com/trinodb/trino/blob/deaff227dea6be94d851345848ec828b53c6b1aa/plugin/trino-hive/src/main/java/io/trino/plugin/hive/BackgroundHiveSplitLoader.java#L918
And the inputFormat is indeed SymlinkTextInputFormat. The function doesn't seems to take in consideration ignoreAbsentPartitions.
Tested with both
hive.ignore-absent-partitions
config and sessionhive.ignore_absent_partitions
Same query runs successfully on Presto and Athena using same glue catalog.