trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.47k stars 3.01k forks source link

TrinoS3FileSystem incorrectly determines directory when using MinIO with GCS #10317

Open kmanamcheri opened 2 years ago

kmanamcheri commented 2 years ago

I am using Minio Gateway in front of GCS. I have an hive external table which I manually created (with partitions).

My partition path is as follows s3a://<readacted>/2021/12/14. There are a bunch of parquet files in that directory. I am now trying to use the Trino Hive connector to access this table. However, Trino keeps throwing exception

io.trino.spi.TrinoException: s3a://<redacted>/2021/12/14 is not a valid Parquet file

Digging a little deeper into this, I think this is the problem

However, this works in Hive. Digging into Hive codebase, I found out that the logic for directory detection in Hadoop/Hive world is

  public static boolean objectRepresentsDirectory(final String name) {
    return !name.isEmpty()
        && name.charAt(name.length() - 1) == '/';
  }

Should Trino also use the Hive method of determining if a given object is a directory? Or does the community recommend fixing this in Minio Gateway.

[update] This issue is related to https://github.com/trinodb/trino/issues/569

[update] To clarify: The problem I don't think is with Minio. The issue is with GCS. GCS does not have a concept of directory.

❯ gsutil stat gs://<redacted>/2021/12/14 

gs://<redacted>/2021/12/14:
    Creation time:          <redacted>
    Update time:            <redacted>
    Storage class:          <redacted>
    Content-Length:      0
    Content-Type:          application/octet-stream
    Hash (crc32c):         <redacted>
    Hash (md5):             <redacted>
    ETag:                         <redacted>
    Generation:              <redacted>
    Metageneration:      <redacted>
WTa-hash commented 2 years ago

We are running into a similar issue with Trino 364.

The parquet files in our S3 bucket are created by Apache Hudi (0.7.0-amzn-1) using AWS EMR 5.33.1 with Spark 2.4.7, and we are using AWS Glue as our hive metastore. Our tables are also partitioned by date (s3://.../yyyy-mm-dd/*.parquet).

After submitting a simple SELECT SQL query, we encounter the following error:

io.trino.spi.TrinoException: s3://[REDACTED]/2021-06-30 is not a valid Parquet File
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:278)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:164)
    at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:286)
    at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:175)
    at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
    at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
    at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
    at io.trino.operator.Driver.processInternal(Driver.java:388)
    at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292)
    at io.trino.operator.Driver.tryWithLock(Driver.java:685)
    at io.trino.operator.Driver.processFor(Driver.java:285)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1078)
    at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.trino.$gen.Trino_364____20220119_155955_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.trino.parquet.ParquetCorruptionException: s3://[REDACTED]/2021-06-30 is not a valid Parquet File
    at io.trino.parquet.ParquetValidationUtils.validateParquet(ParquetValidationUtils.java:26)
    at io.trino.parquet.reader.MetadataReader.readFooter(MetadataReader.java:89)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:213)
    ... 17 more

If I list all the files within the AWS S3 partition folder (s3://[REDACTED]/2021-06-30), I see that the folder 2021-06-30 is listed as a file (see below).

> aws s3 ls s3://[REDACTED]/2021-06-30 --recursive

2021-12-16 18:42:26          0 [REDACTED]/2021-06-30/
2021-06-30 00:04:16         93 [REDACTED]/2021-06-30/.hoodie_partition_metadata
2021-08-18 14:37:29   10461390 [REDACTED]/2021-06-30/6b148b7d-0bc5-4fac-9381-edb37c9a5689-0_0-629814-5584811_20210818143708.parquet
2021-06-30 00:04:15          0 [REDACTED]/2021-06-30_$folder$

Inspection of the s3://[REDACTED]/2021-06-30/ object reveals its content type is application/octet-stream and not application/x-directory.

> aws s3api head-object --bucket [REDACTED] --key [REDACTED]/2021-06-30/

{
    "AcceptRanges": "bytes",
    "ContentType": "application/octet-stream",
    "LastModified": "Thu, 16 Dec 2021 18:42:26 GMT",
    "ContentLength": 0,
    "VersionId": "[REDACTED]",
    "ETag": "\"[REDACTED]\"",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

I am wondering if AWS introduced a S3-level change that affected directories at around 2021-12-15 time frame. Querying from this table partition works if I move all files from the old directory to a new directory with content type application/x-directory.

hashhar commented 2 years ago

cc: @findepi I think we can have a more lenient check in addition to the existing media-type check. Something like content-lenght == 0 && key.endsWith('/')?

I am unable to think of cases where we can have a valid zero-byte file with name ending in /.

kmanamcheri commented 2 years ago

+1 @hashhar

FWIW, I do think Trino is doing the right thing... but fortunate or unfortunately, hadoop/hive style of conventions is widely adopted and the right decision at this point would be to follow conventions so as not to confuse users.

findepi commented 2 years ago

cc @losipiuk

steveloughran commented 3 months ago

go with "/" means directory regardless of length. if client side encryption is used it may be > 0 bytes long. And if people were putting data in there, well, its likely to get deleted during some housekeeping operation