mitodl / ol-infrastructure

Infrastructure automation code for use by MIT Open Learning
BSD 3-Clause "New" or "Revised" License
47 stars 4 forks source link

Allow Redash to load table list for Trino data source #1149

Closed blarghmatey closed 2 years ago

blarghmatey commented 2 years ago

User Story

As a data analyst I would like to be able to construct queries against the Trino engine in our Redash BI system.

Description/Context

Our Redash installation is unable to load the schema information for the data lake from Trino. This could be due to the sheer number of tables, a permission issue in the account used in the connection, or some other bug in Redash.

Acceptance Criteria

shaidar commented 2 years ago

Testing the connection on redash (QA) works, however it continues to fail loading any schemas and it’s not a permissions thing. Was looking at the trino query code in redash to see what it’s running to load the schemas and trying running that same query in starburst and it just hangs.

https://github.com/getredash/redash/blob/master/redash/query_runner/trino.py#L79-L85

shaidar commented 2 years ago

Tried running this in Starburst:

SELECT "table_schema" FROM "ol_data_lake_qa"."information_schema"."columns" LIMIT 10;

which throws the following error:


User: arn:aws:sts::xxx:assumed-role/data-lake-query-engine-role-qa/trino-session is not authorized to perform: glue:GetTables on resource: arn:aws:glue:us-east-1:xxx:database/information_schema because no identity-based policy allows the glue:GetTables action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 35ae286c-f253-4ee0-bd75-c021d6f18af3; Proxy: null)
shaidar commented 2 years ago

Fixed permission on the AWS side, and tried running the query again on the Starburst side without a LIMIT. Ran the following which is a subset of what the Redash Trino connector runs to refresh the schema:

SELECT "table_name" FROM "ol_data_lake_qa"."information_schema"."columns";

That took 23m 39s to complete which kinda explains why testing the Trino data source connection works however it fails to refresh the schema.

shaidar commented 2 years ago

The exact SQL statement that Redash Trino runs took 26m 20s to complete when I ran it in Starburst. Clearly that's not gonna work and we're gonna have to maybe reach out to Starburst folks to see what the issue is and what can be done on their end.

shaidar commented 2 years ago

Starburst folks asked me to run:

select count(*)  FROM information_schema.columns
WHERE table_schema NOT IN ('pg_catalog', 'information_schema');

It took 23m 19s and returned 14,603 rows. Based on what I was told, Starburst/Trino has to make calls to AWS Glue for that info which ends up taking quite a while. They had a similar issue with Powerbi and the way they fixed it was by changing the code to only query objects that the user expanded. There is no real caching layers (10 seconds) since new objects wouldn't show up. Gist of it is that it's bad code on the Redash side and we should open an issue or try and reach out to someone on that team.

shaidar commented 2 years ago

It appears that even when schema is specified in the Redash Trino data source, that schema is not used when Redash runs its get_schema function. I ran the following query in Starburst to see how long it'll take if a schema is specified:

select count(*)  FROM information_schema.columns WHERE table_schema IN ('ol_warehouse_qa_intermediate');

That query took 7s to run and returned with 79 rows. So we probably need to patch Redash so that it actually uses the configured schema when running the get_schema function.

blarghmatey commented 2 years ago

I added a patch in our fork of Redash that adds filtering of the schema when it is configured in the data source to help resolve this issue. https://github.com/mitodl/redash/pull/6/files

blarghmatey commented 2 years ago

That fix has been merged and validated on QA and production.