Metadata retrieval failure in one catalog results in unablity to retrieve any meta data

mdesmet commented 1 year ago

Actual behaviour:

Some BI tools that query the system.jdbc or information_schema tables (tables, columns, schemata) for metadata retrieval fail when one of the catalogs return an exception if the failing catalog is not filtered out (depending on the query if the catalog predicate is pushed down).

Expected behaviour:

A SQL query can only fail of succeed, which makes this a bit of a tricky behaviour. Ignoring errors shouldn't be the default behaviour. A client should explicitly opt in to ignore metadata retrieval errors by setting a session property. Typically this would only be done in BI tools and not in data pipelines (those may use metadata for decision logic).

Steps to reproduce:

Deploy a Trino instance with multiple catalogs
Add a postgres catalog with incorrect login/password
Execute queries on system.jdbc tables without filtering out the postgres catalog

@leniartek : At this point we only know about DataIku. Could you list any other BI tools that you are aware of that are subject to this issue.

kokosing commented 1 year ago

See https://github.com/trinodb/trino/issues/6551#issuecomment-903798696 and please let me know if the proposal there addresses your use case.

mdesmet commented 1 year ago

See #6551 (comment) and please let me know if the proposal there addresses your use case.

The risk that I see in that proposal is that if within a data pipeline queries on a certain catalog are used to do things in another catalog will result in faulty behaviour when transient issues occur. IMHO this should never be the default behaviour.

Jorricks commented 1 year ago

Looker has the exact same issue when running on bare metal HDFS. It is unable to get some tables metadata as it does not have read permission on PII tables, meaning, the metadata query fails and thus the entire query fails.

electrum commented 1 year ago

Which tools query for tables or columns across all catalogs? Querying tables is expensive, and querying columns is extremely expensive. What do they do with this information?

Do users want to query across all catalogs? If so, do they want to silently ignore missing data?

My initial thought is that this is a bug in those tools. We could add an option in the driver to always filter to a specific catalog if none is specified. The MySQL JDBC driver has an option nullDatabaseMeansCurrent which seems to be for this purpose.

tooptoop4 commented 1 year ago

mdesmet commented 1 year ago

I think there are use cases where multiple catalogs needs to be queried together. Also in the case of joins a filter predicate is not always pushed down onto the system table (see the example underneath).

This query could be fixed in this example by explicitly adding the catalog predicate on the joined table. However BI tools are not made for federated databases, they rather assume a single database server. It would be better to fix the query but is it actually feasible?

trino:default> explain analyze select case when t.table_type = 'BASE TABLE' then 'table'
            ->              when t.table_type = 'VIEW' then 'view'
            ->              when mv.name is not null then 'materializedview'
            ->              else t.table_type end as table_type from iceberg.INFORMATION_SCHEMA.tables t
            ->       left join system.metadata.materialized_views mv on mv.catalog_name = t.table_catalog and mv.schema_name = t.table_schema and mv.name = t.table_name;
                                                                                                               Query Plan                                                                               >
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Trino version: 406                                                                                                                                                                                     ...
 Fragment 2 [SOURCE]
     CPU: 17.66ms, Scheduled: 112.09ms, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 13 rows (922B); per task: avg.: 13.00 std.dev.: 0.00, Output: 13 rows (1.01kB)
     Output layout: [table_catalog, table_schema, table_name, table_type, $hashvalue_0]
     Output partitioning: HASH [table_catalog, table_schema, table_name][$hashvalue_0]
     ScanProject[table = iceberg:InformationSchemaTableHandle{catalogName=iceberg, table=TABLES, prefixes=[iceberg.*.*], limit=OptionalLong.empty}]
         Layout: [table_catalog:varchar, table_schema:varchar, table_name:varchar, table_type:varchar, $hashvalue_0:bigint]
         Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
         CPU: 17.00ms (21.25%), Scheduled: 112.00ms (0.70%), Blocked: 0.00ns (0.00%), Output: 13 rows (1.01kB)
         Input avg.: 13.00 rows, Input std.dev.: 0.00%
         $hashvalue_0 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("table_catalog"), 0)), COALESCE("$operator$hash_code"("table_schema"), 0)), COALESCE("$operator$hash_code"("table_name"), 0))
         table_schema := table_schema
         table_catalog := table_catalog
         table_name := table_name
         table_type := table_type
         Input: 13 rows (922B), Filtered: 0.00%, Physical Input: 922B

 Fragment 3 [SOURCE]
     CPU: 60.84ms, Scheduled: 15.89s, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 1 row (85B); per task: avg.: 1.00 std.dev.: 0.00, Output: 1 row (43B)
     Output layout: [catalog_name, schema_name, name, $hashvalue_3]
     Output partitioning: HASH [catalog_name, schema_name, name][$hashvalue_3]
     ScanProject[table = system:metadata.materialized_views]
         Layout: [catalog_name:varchar, schema_name:varchar, name:varchar, $hashvalue_3:bigint]
         Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
         CPU: 60.00ms (75.00%), Scheduled: 15.89s (99.28%), Blocked: 0.00ns (0.00%), Output: 1 row (43B)
         Input avg.: 1.00 rows, Input std.dev.: 0.00%
         $hashvalue_3 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("catalog_name"), 0)), COALESCE("$operator$hash_code"("schema_name"), 0)), COALESCE("$operator$hash_code"("name"), 0))
         catalog_name := catalog_name
         name := name
         schema_name := schema_name
         Input: 1 row (85B), Filtered: 0.00%, Physical Input: 85B

(1 row)

kokosing commented 1 year ago

I am under impression that the current behavior is ok. Ignoring the error could open a pandora box of other issues. So I guess it is ok to be loud about misconfiguration issues that should be fixed by Trino admins.

I suggest to close this issue as "won't do".

hashhar commented 8 months ago

@mdesmet @leniartek other than Dataiku and Looker are there other clients we are aware of? e.g. Dbeaver, Mode etc.

I'm asking that because we want to test https://github.com/trinodb/trino/pull/20866 to see if it solves some of the issues.

EDIT: Here's a list of known clients which are affected.

Collibra
Mode
Dataiku
Looker

trinodb / trino

Metadata retrieval failure in one catalog results in unablity to retrieve any meta data #16361