Open kmanamcheri opened 2 years ago
We are running into a similar issue with Trino 364.
The parquet files in our S3 bucket are created by Apache Hudi (0.7.0-amzn-1) using AWS EMR 5.33.1 with Spark 2.4.7, and we are using AWS Glue as our hive metastore. Our tables are also partitioned by date (s3://.../yyyy-mm-dd/*.parquet).
After submitting a simple SELECT SQL query, we encounter the following error:
io.trino.spi.TrinoException: s3://[REDACTED]/2021-06-30 is not a valid Parquet File
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:278)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:164)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:286)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:175)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
at io.trino.operator.Driver.processInternal(Driver.java:388)
at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292)
at io.trino.operator.Driver.tryWithLock(Driver.java:685)
at io.trino.operator.Driver.processFor(Driver.java:285)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1078)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
at io.trino.$gen.Trino_364____20220119_155955_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.trino.parquet.ParquetCorruptionException: s3://[REDACTED]/2021-06-30 is not a valid Parquet File
at io.trino.parquet.ParquetValidationUtils.validateParquet(ParquetValidationUtils.java:26)
at io.trino.parquet.reader.MetadataReader.readFooter(MetadataReader.java:89)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:213)
... 17 more
If I list all the files within the AWS S3 partition folder (s3://[REDACTED]/2021-06-30), I see that the folder 2021-06-30
is listed as a file (see below).
> aws s3 ls s3://[REDACTED]/2021-06-30 --recursive
2021-12-16 18:42:26 0 [REDACTED]/2021-06-30/
2021-06-30 00:04:16 93 [REDACTED]/2021-06-30/.hoodie_partition_metadata
2021-08-18 14:37:29 10461390 [REDACTED]/2021-06-30/6b148b7d-0bc5-4fac-9381-edb37c9a5689-0_0-629814-5584811_20210818143708.parquet
2021-06-30 00:04:15 0 [REDACTED]/2021-06-30_$folder$
Inspection of the s3://[REDACTED]/2021-06-30/ object reveals its content type is application/octet-stream
and not application/x-directory
.
> aws s3api head-object --bucket [REDACTED] --key [REDACTED]/2021-06-30/
{
"AcceptRanges": "bytes",
"ContentType": "application/octet-stream",
"LastModified": "Thu, 16 Dec 2021 18:42:26 GMT",
"ContentLength": 0,
"VersionId": "[REDACTED]",
"ETag": "\"[REDACTED]\"",
"ServerSideEncryption": "AES256",
"Metadata": {}
}
I am wondering if AWS introduced a S3-level change that affected directories at around 2021-12-15 time frame.
Querying from this table partition works if I move all files from the old directory to a new directory with content type application/x-directory
.
cc: @findepi I think we can have a more lenient check in addition to the existing media-type check.
Something like content-lenght == 0 && key.endsWith('/')
?
I am unable to think of cases where we can have a valid zero-byte file with name ending in /
.
+1 @hashhar
FWIW, I do think Trino is doing the right thing... but fortunate or unfortunately, hadoop/hive style of conventions is widely adopted and the right decision at this point would be to follow conventions so as not to confuse users.
cc @losipiuk
go with "/" means directory regardless of length. if client side encryption is used it may be > 0 bytes long. And if people were putting data in there, well, its likely to get deleted during some housekeeping operation
I am using Minio Gateway in front of GCS. I have an hive external table which I manually created (with partitions).
My partition path is as follows
s3a://<readacted>/2021/12/14
. There are a bunch of parquet files in that directory. I am now trying to use the Trino Hive connector to access this table. However, Trino keeps throwing exceptionio.trino.spi.TrinoException: s3a://<redacted>/2021/12/14 is not a valid Parquet file
Digging a little deeper into this, I think this is the problem
MediaType.parse(metadata.getContentType()).is(DIRECTORY_MEDIA_TYPE)
to detect if a given object is a directory"ContentType": "application/octet-stream"
even for directoriesHowever, this works in Hive. Digging into Hive codebase, I found out that the logic for directory detection in Hadoop/Hive world is
Should Trino also use the Hive method of determining if a given object is a directory? Or does the community recommend fixing this in Minio Gateway.
[update] This issue is related to https://github.com/trinodb/trino/issues/569
[update] To clarify: The problem I don't think is with Minio. The issue is with GCS. GCS does not have a concept of directory.