trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.33k stars 2.97k forks source link

`HIVE_FILESYSTEM_ERROR` and `HIVE_CURSOR_ERROR` in S3/Athena #19225

Closed roykoand closed 1 year ago

roykoand commented 1 year ago

Hello Team,

Since 9/16 I see a very strange behaviour in AWS Athena reading ORC/Parquet files from S3. Every day since I'm receiving errors like these:

HIVE_FILESYSTEM_ERROR: Error reading from s3://XX/YY/ZZZ at position 12345. If a data manifest file was generated at 's3://XXX-***-REGION/QUERY-ID-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

code source of the error: https://github.com/trinodb/trino/blob/071c8365faf83aaedfcda889cd2e8a28aab165fe/plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/HdfsOrcDataSource.java#L81-L87

HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://XX/YY/ZZZ. If a data manifest file was generated at 's3:/XXX-***-REGION/QUERY-ID-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

code source of the error: https://github.com/trinodb/trino/blob/f8e774a949773399df5fa823186e4a68d79b931d/plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSource.java#L192-L201

Error rate is very low, like <0.01% and mostly queries succeed after the rerun but it's annoying to have. For several reasons I'm not able to reach out to the AWS Support, so what can I do to properly debug/investigate/fix the issue? On my honest opinion, it's the AWS problem and has nothing to do with the quality of data. 2 days after (9/18) the issues started popping up AWS had 8-hours networking outage in us-west-2 region where Athena/S3 located. I've done a variety of tests and I wasn't able to reproduce the issue.

Athena engine version is v3. Additional note: usually errors appear when several processes are running and reading the same S3 bucket (not necessarily the same S3 prefix)

What are your thoughts on this? What can I do to at least properly diagnose the problem?

Thank you!

electrum commented 1 year ago

Apologies, but this error is specific to Athena, so AWS support would be best able to assist you. My only suggestion, based on the error message, is to consider deleting the manifest.csv file mentioned in the error message (after taking appropriate care to understand the consequences and making a backup copy).

Alternatively, you might try running Trino yourself, or using a different hosted version such as Starburst Galaxy (which has a free tier and its own support).