trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.2k stars 2.94k forks source link

Trino Iceberg s3 path handling incompatibility when using `s3a://` or `s3n://` paths with `//` #23097

Open sopel39 opened 4 weeks ago

sopel39 commented 4 weeks ago

Introduction:

Trino breaking behaviour

  1. Create table in Trino with legacy FS:
    trino> create table iceberg.schema.table (x bigint) with (location='s3a://bucket/schema//table')
    CREATE TABLE
  2. trino> show create table iceberg.schema.table;
                                               Create Table
    ----------------------------------------------------------------------------------------------------------
    CREATE TABLE iceberg.schema.table (
    x bigint
    )
    WITH (
    format = 'PARQUET',
    format_version = 2,
    location='s3a://bucket/schema//table')
    )
  3. Query with native FS
    trino> select * from iceberg.schema.table;
    ...
    Caused by: java.io.FileNotFoundException: s3a://bucket/schema//table/metadata/00002-279c8ac6-0628-4687-aeb5-ef58b53d3b4a.metadata.json
    at io.trino.filesystem.s3.S3InputStream.seekStream(S3InputStream.java:199)
  4. Query data with other engines, e.g: spark:
    spark-sql ()> select * from iceberg.schema.table;
    24/08/21 13:35:46 ERROR SparkSQLDriver: Failed in [select * from iceberg.schema.table]
    org.apache.iceberg.exceptions.ServiceFailureException: Server error: NotFoundException: Location does not exist: s3a://bucket/schema//table/metadata/00002-279c8ac6-0628-4687-aeb5-ef58b53d3b4a.metadata.json

Problem

Solution proposals

1) Make native FS behave like legacy FS with regards to s3a/s3n (under feature flag)? 2) Try to open both locations (with/without //), but write only with // for both legacy and native?

cc @electrum @osscm @rstyp

findepi commented 4 weeks ago

There was a problem with // in paths with resulted in both (a) // being replaced with / + (b) url-encoded path being appended to the path. It looks like this is a different problem, but would be great if you could confirm.

findepi commented 3 weeks ago

https://github.com/trinodb/trino/issues/17803 might be the issue, but not sure, it does not have examples.

sopel39 commented 3 weeks ago

https://github.com/trinodb/trino/issues/17803 might be the issue, but not sure, it does not have examples.

Seems like similar issue. https://github.com/trinodb/trino/pull/17958 doesn't help if other engine creates Hive such table with //. Anyway, mine issue is more about Iceberg than Hive.

findinpath commented 3 weeks ago

@sopel39 is this issue reproducible with Trino 454 ?

sopel39 commented 3 weeks ago

@findinpath yes, the issue is reproducible since when io.trino.filesystem.hdfs.HadoopPaths#hadoopPath was introduced, so many years backward.

We are actually leaning towards creating an Iceberg table property, e.g: native_fs_legacy_path_mode, which would remove // from S3 paths for selected tables when native FS is used. This would 1) allow to access "broken" tables from native FS 2) make new Iceberg correct going forward.

osscm commented 3 weeks ago

Yes, we need this, as soon legacy fs will be out, so we need way to read the broken tables from the native.

We are also leaning to keep this mode/functionality exactly same as legacy fs. Where it is handling only s3 for // and not s3a or s3n

zhaner08 commented 3 weeks ago

we definitely have people relying on the legacy behavior, at least a flag to support the behavior if we dont want to make it as a default

findinpath commented 2 weeks ago

We went through with the investigation on this one (although it took a longer time to get to look deeper into it). apologies for the delay in replying.

tldr;

Moving on the object storage the files from s3://bucket/double/slashes/table1 to s3://bucket/double//slashes/table1 (notice / ->// before slashes) fixes the problem and corresponds to the location of the files linked from the metadata and manifest files in the Iceberg tables.

What's left to do is to adjust HadoopPaths in legacy fs to support not only s3 , but also s3a & s3n

https://github.com/trinodb/trino/blob/6aba7482b8d88d4ae5f7e08d0fbca8e958d5e7ed/lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/HadoopPaths.java#L32

We'll follow-up soon with a PR and we'll work with @mosabua on documenting best a "fix" plan for this unfortunate hick-up.

Why is "move" (copy to new location & delete old files) in detriment of a backward-compatibility table property being suggested?

The table created on s3a or s3n without properly encoding the paths is effectively broken because the it has metadata files pointing to invalid locations. I see no reason to bend the trinodb/trino s3 native file system to deal with broken tables.

mosabua commented 2 weeks ago

Why is "move" (copy to new location & delete old files) in detriment of a backward-compatibility table property being suggested?

The table created on s3a or s3n without properly encoding the paths is effectively broken because the it has metadata files pointing to invalid locations. I see no reason to bend the trinodb/trino s3 native file system to deal with broken tables.

I think we can not force users to perform this move when the migrate from the legacy file system to the new file system support. After all they used Trino in the first place to get the tables in place. It potentially comes with large cost and risk and we should offer a way for users to migrate to the new file system without a move.

mosabua commented 2 weeks ago

Also we are actively discussing this on slack and it will take some time to make a decision together with @electrum and others.

https://trinodb.slack.com/archives/C07ABNN828M/p1724946906029559

mayankvadariya commented 1 week ago

Meeting minutes with @mosabua @electrum @findinpath @mayankvadariya

We've decided to add a Trino Iceberg table migration procedure to update table metadata/manifests to contain the correct S3 path so that the table will be queryable in Native file system.