starburstdata / dbt-trino

The Trino (https://trino.io/) adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
194 stars 52 forks source link

Executing 'dbt docs generate' when the 'where' clause has different combinations of schemas > produces a permission error #407

Open wusanny opened 1 month ago

wusanny commented 1 month ago

Expected behavior

When we execute dbt docs generate on sources that are located in different schemas - we should be able to access them from the different schemas with no permission issue.

Actual behavior

Filing this on behalf of the dbt team - for a customer who is reporting of the error. To add more context, customer has their datalake in AWS and they use a Trino server to connect to the S3 buckets with the hive connector(glue) for reading and the glacier connector (also glue) for writing.

Executing dbt docs generate on sources located in different schemas, produces an error that seems to suggest that we are not allowed to access the tables.

However, when we run dbt docs generate separately based on the sources from the same schema - everything works as expected.

Steps To Reproduce

in dbt Cloud IDE

  1. Run dbt docs generate on sources that are all in 1 schema -> there should be no error
  2. Run dbt docs generate on sources that are in more than 1 schema -> the error will show up
  3. Run dbt docs generate just on that individual source that was added in second step -> there should be no error

Looking into the logs, we can see the difference is in the where clause of the SQL statement that is being sent to Trino.

Example: First scenario - I run it leaving only references to dbt_foo:

select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        table_type as "table_type",
        null as "table_owner"
    from "awsdatacatalog".INFORMATION_SCHEMA.tables
            where
        table_schema != 'information_schema'
        and
        (
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_1'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_2'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_3'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_4'
                    )
                )
limit 100;

The query runs without a problem.

Second scenario - if we include just one mention to a different schema (dbt_bar, table_5) the error appears:

select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        table_type as "table_type",
        null as "table_owner"
    from "awsdatacatalog".INFORMATION_SCHEMA.tables
            where
        table_schema != 'information_schema'
        and
        (
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_1'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_2'
                    )
                 or
                    (
                        table_schema = 'dbt_bar'
                        and table_name = 'table_5'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_3'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_4'
                    )
                )
limit 100;

Error message

[16777216] Query failed (#20240409_040702_00007_egzve): Error listing tables for catalog awsdatacatalog: User: arn:aws:sts::686806083451:assumed-role/oftcloudcheck_quality_role/i-0d805c3bcb8fd3aa3 is not authorized to perform: glue:GetTable on resource: arn:aws:glue:us-east-1:419887539293:catalog because no resource-based policy allows the glue:GetTable action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 240601db-f2a4-489b-951f-a9ae4ecc8321; Proxy: null) com.amazonaws.services.glue.model.AccessDeniedException: User: arn:aws:sts::686806083451:assumed-role/oftcloudcheck_quality_role/i-0d805c3bcb8fd3aa3 is not authorized to perform: glue:GetTable on resource: arn:aws:glue:us-east-1:419887539293:catalog because no resource-based policy allows the glue:GetTable action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 240601db-f2a4-489b-951f-a9ae4ecc8321; Proxy: null)

Last scenario - I query the added mention specifically to dbt_bar the query runs OK:

select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        table_type as "table_type",
        null as "table_owner"
    from "awsdatacatalog".INFORMATION_SCHEMA.tables
            where
        table_schema != 'information_schema'
        and
        (
                    (
                        table_schema = 'dbt_bar'
                        and table_name = 'table_5'

                )
          )
limit 100;

If it was indeed a permission error, querying to dbt_bar would produce the same error.

We tested querying both schemas with a union:

select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        table_type as "table_type",
        null as "table_owner"
    from "awsdatacatalog".INFORMATION_SCHEMA.tables
            where
        table_schema != 'information_schema'
        and
        (
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_1'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_2'
                    )
                 or

                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_3'
                    )
                 or
                    (
                        table_schema = 'dbt_foo'
                        and table_name = 'table_4'
                    )
                )
UNION

select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        table_type as "table_type",
        null as "table_owner"
    from "awsdatacatalog".INFORMATION_SCHEMA.tables
            where
        table_schema != 'information_schema'
        and
        (
                    (
                        table_schema = 'dbt_bar'
                        and table_name = 'table_5'

                )
         )
limit 100;

This worked as expected.

Log output/Screenshots

No response

Operating System

Ubuntu 22.04.3 LTS

dbt version

1.7

Trino Server version

441

Python version

Python 3.10.12

Are you willing to submit PR?

damian3031 commented 1 month ago

Could you provide the exact dbt-trino version that you use? If possible, please provide the output of the dbt debug command; it will give us all the basic information.

jesusarroyoo commented 1 month ago

Hello @damian3031 I send you the output of the dbt debug command:

16:17:13 Running with dbt=1.7.13 16:17:13 Registered adapter: trino=1.7.1 16:17:15 Found 26 models, 9 analyses, 8 seeds, 16 sources, 0 exposures, 0 metrics, 421 macros, 0 groups, 0 semantic models 16:17:15 Concurrency: 6 threads (target='prod') 16:17:16 Building catalog 16:17:18 Encountered an error while generating catalog: Database Error TrinoExternalError(type=EXTERNAL, name=HIVE_METASTORE_ERROR, message="Error listing table columns for catalog awsdatacatalog: User: arn:aws:sts::user1 is not authorized to perform: glue:GetTable on resource: arn:aws:glue:us-east-1:1:catalog because no resource-based policy allows the glue:GetTable action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 1; Proxy: null)", query_id=1_mpbuw) 16:17:18 dbt encountered 1 failure while writing the catalog

Let me know if you need anything else.