trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.14k stars 2.92k forks source link

Feature request: default iceberg table location using REST catalog #22583

Open olivier-derom opened 2 months ago

olivier-derom commented 2 months ago

According to the Trino Documentation the location property should be optional.

When using the nessie and glue catalog with iceberg, this was the case. Without declaring a location, a relative path was used to the default-warehouse-dir following mywarehouse/myschema/mytable.

Recently I switched from nessie catalog to REST catalog (REST over nessie). Now when Trying to create a schema and table in it without specifying a specific location, I get 'location must be set for myschema'.

Should Trino using iceberg with REST catalog not automatically generate a location to the default warehouse dir (iceberg.rest-catalog.warehouse)?

olivier-derom commented 2 months ago

I would suggest updating defaultTableLocation (REST) of TrinoRestCatalog.java with something similar to the other catalog's defaultTableLocation (Nessie)

Feel free to open a PR on this and link this issue.

olivier-derom commented 2 months ago

As requested by @findinpath, some additional info.

We see similar behaviour in Trino and Spark, where this is defined at catalog level:

Nessie: both Trino and Spark create default location

JDBC catalog: both Trino and Spark create default location

Hadoop catalog: Spark creates default location

HMS: Trino and Spark create default location

We observe that Trino behaves similarly to how Spark would, what is to be expected as uniform behaviour between query engines.

In Iceberg's REST catalog, this default location is not defined. As a default warehouse location is not mandatory in the catalog contig and to keep flexibility within adopting the REST catalog. However, I dont think we should limit ourselves to this, as we already provide the possibility for users using the REST catalog to define a warehouse location as can be observed in the documentation. I dont see any drawbacks into adopting a default-location-approach in case no location has explicitly been given, in contrast, it would be more consistent with all other available catalogs and makes queries much more flexible.

You could argue that we want to strictly keep the same logic as defined in the iceberg spec, in which case I suggest we could make this an opt-in with an additional parameter in trino for rest-catalog: iceberg.rest-catalog.warehouse.location & iceberg.rest-catalog.warehouse.use-default

Let me know what you think.

olivier-derom commented 1 month ago

@ebyhr could you add the iceberg and enhancement tags? Thanks!

olivier-derom commented 1 month ago

Worth mentioning: dbt is the reason I request this. dbt models use SELECT queries which are transformed into CREATE TABLE queries. These models are run on different environments (prod and dev), and does not allow us to specify the location for each table. Therefore automatically determining the locations with REST catalog would be a nice addition. The reason we want to use REST catalog is because of its support for view, which is heavily used in dbt intermediate models and incremental models.

Any second opinion is welcome.