Open mdesmet opened 1 year ago
See https://github.com/trinodb/trino/issues/6551#issuecomment-903798696 and please let me know if the proposal there addresses your use case.
See #6551 (comment) and please let me know if the proposal there addresses your use case.
The risk that I see in that proposal is that if within a data pipeline queries on a certain catalog are used to do things in another catalog will result in faulty behaviour when transient issues occur. IMHO this should never be the default behaviour.
Looker has the exact same issue when running on bare metal HDFS. It is unable to get some tables metadata as it does not have read permission on PII tables, meaning, the metadata query fails and thus the entire query fails.
Which tools query for tables or columns across all catalogs? Querying tables is expensive, and querying columns is extremely expensive. What do they do with this information?
Do users want to query across all catalogs? If so, do they want to silently ignore missing data?
My initial thought is that this is a bug in those tools. We could add an option in the driver to always filter to a specific catalog if none is specified. The MySQL JDBC driver has an option nullDatabaseMeansCurrent
which seems to be for this purpose.
similar to https://github.com/trinodb/trino/issues/1173
I think there are use cases where multiple catalogs needs to be queried together. Also in the case of joins a filter predicate is not always pushed down onto the system table (see the example underneath).
This query could be fixed in this example by explicitly adding the catalog predicate on the joined table. However BI tools are not made for federated databases, they rather assume a single database server. It would be better to fix the query but is it actually feasible?
trino:default> explain analyze select case when t.table_type = 'BASE TABLE' then 'table'
-> when t.table_type = 'VIEW' then 'view'
-> when mv.name is not null then 'materializedview'
-> else t.table_type end as table_type from iceberg.INFORMATION_SCHEMA.tables t
-> left join system.metadata.materialized_views mv on mv.catalog_name = t.table_catalog and mv.schema_name = t.table_schema and mv.name = t.table_name;
Query Plan >
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
Trino version: 406 ...
Fragment 2 [SOURCE]
CPU: 17.66ms, Scheduled: 112.09ms, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 13 rows (922B); per task: avg.: 13.00 std.dev.: 0.00, Output: 13 rows (1.01kB)
Output layout: [table_catalog, table_schema, table_name, table_type, $hashvalue_0]
Output partitioning: HASH [table_catalog, table_schema, table_name][$hashvalue_0]
ScanProject[table = iceberg:InformationSchemaTableHandle{catalogName=iceberg, table=TABLES, prefixes=[iceberg.*.*], limit=OptionalLong.empty}]
Layout: [table_catalog:varchar, table_schema:varchar, table_name:varchar, table_type:varchar, $hashvalue_0:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
CPU: 17.00ms (21.25%), Scheduled: 112.00ms (0.70%), Blocked: 0.00ns (0.00%), Output: 13 rows (1.01kB)
Input avg.: 13.00 rows, Input std.dev.: 0.00%
$hashvalue_0 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("table_catalog"), 0)), COALESCE("$operator$hash_code"("table_schema"), 0)), COALESCE("$operator$hash_code"("table_name"), 0))
table_schema := table_schema
table_catalog := table_catalog
table_name := table_name
table_type := table_type
Input: 13 rows (922B), Filtered: 0.00%, Physical Input: 922B
Fragment 3 [SOURCE]
CPU: 60.84ms, Scheduled: 15.89s, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 1 row (85B); per task: avg.: 1.00 std.dev.: 0.00, Output: 1 row (43B)
Output layout: [catalog_name, schema_name, name, $hashvalue_3]
Output partitioning: HASH [catalog_name, schema_name, name][$hashvalue_3]
ScanProject[table = system:metadata.materialized_views]
Layout: [catalog_name:varchar, schema_name:varchar, name:varchar, $hashvalue_3:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
CPU: 60.00ms (75.00%), Scheduled: 15.89s (99.28%), Blocked: 0.00ns (0.00%), Output: 1 row (43B)
Input avg.: 1.00 rows, Input std.dev.: 0.00%
$hashvalue_3 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("catalog_name"), 0)), COALESCE("$operator$hash_code"("schema_name"), 0)), COALESCE("$operator$hash_code"("name"), 0))
catalog_name := catalog_name
name := name
schema_name := schema_name
Input: 1 row (85B), Filtered: 0.00%, Physical Input: 85B
(1 row)
I am under impression that the current behavior is ok. Ignoring the error could open a pandora box of other issues. So I guess it is ok to be loud about misconfiguration issues that should be fixed by Trino admins.
I suggest to close this issue as "won't do".
@mdesmet @leniartek other than Dataiku and Looker are there other clients we are aware of? e.g. Dbeaver, Mode etc.
I'm asking that because we want to test https://github.com/trinodb/trino/pull/20866 to see if it solves some of the issues.
EDIT: Here's a list of known clients which are affected.
Actual behaviour:
Some BI tools that query the system.jdbc or information_schema tables (tables, columns, schemata) for metadata retrieval fail when one of the catalogs return an exception if the failing catalog is not filtered out (depending on the query if the catalog predicate is pushed down).
Expected behaviour:
A SQL query can only fail of succeed, which makes this a bit of a tricky behaviour. Ignoring errors shouldn't be the default behaviour. A client should explicitly opt in to ignore metadata retrieval errors by setting a session property. Typically this would only be done in BI tools and not in data pipelines (those may use metadata for decision logic).
Steps to reproduce:
@leniartek : At this point we only know about DataIku. Could you list any other BI tools that you are aware of that are subject to this issue.