joshtemple / harlequin-bigquery

A BigQuery adapter for Harlequin, a SQL IDE for the terminal.
MIT License
8 stars 2 forks source link

Data Catalog Error: List index out of range #13

Open theodotdot opened 6 months ago

theodotdot commented 6 months ago

I have installed Harlequin-BigQuery through pipx.

Running on Ubuntu 22.04

I am successfully authenticated using gcloud.

I have the necessary permissions to run the query in the get_catalog function (tested on the BQ console).

Queries on Harlequin work, only the data catalog produces an error on launch: image

I had a look at the adapter.py but couldn't figure it out myself. Seems like it has something to do with the list iteration over the results of the query.

I work with a big project so the produced tables from the get_catalog function yields 109,106 rows which might be the issue?

I cannot disclose information on the actual project I am working with but let me know if you need anything else!

tconbeer commented 3 months ago

Seems related to https://github.com/tconbeer/harlequin/issues/576

thmswt commented 2 months ago

Hi @theodotdot, I tried to reproduce the error in a Google Cloud project with more then 100k rows of tables/columns aswell, but the data catalog displays the tree as expected. It seems that the number of rows returned by the query on INFORMATION_SCHEMA is not the reason for the problem, but maybe an edge case somewhere in the data returned by the query.

The only way I can reproduce the exact same error is when columns exist without reference to a table. Unfortunately, I have not found any information in the BigQuery documentation where this case can actually occur, nor do my projects return such results in BQ...

To fix it, we can handle it in adapter.py, but I am interested if you can query your project with the following SQL, just to see if your projects contain such edge cases:

select
    datasets.schema_name as dataset_id,
    tables.table_name as table_id,
    tables.table_type as table_type,
    columns,column_name,
    columns.data_type as column_type
from `{self.project}.region-{self.location}.INFORMATION_SCHEMA.SCHEMATA` datasets
left join `{self.project}.region-{self.location}.INFORMATION_SCHEMA.TABLES` tables
on datasets.catalog_name = tables.table_catalog
and datasets.schema_name = tables.table_schema
left join `{self.project}.region-{self.location}.INFORMATION_SCHEMA.COLUMNS` columns
using (table_catalog, table_schema, table_name)
where tables.table_name is null and column_name is not null
order by dataset_id, table_id, column_name

I just added the where clause to the original statement found in adapter.py.

Let me know if this query returns any data.

Thanks.

theodotdot commented 2 months ago

Hey @thmswt, thanks for having a look!

I tried running the query that you sent but it returns an empty result so I don't think there are any columns without refs.

Recently I have been working in another GCP project within the same org and the data catalog has been working fine. I also have been testing it on a bunch of other projects but they are all much smaller in terms of number of tables, datasets etc and they all seem to work fine.

My guess would be that there is something funky happening for this project in particular that's creating an edge case but I have looked and apart from the fact that it's massive. There are about 3k tables and 150k rows worth of tables/columns (from the query you sent).

Happy to keep troubleshooting if you have other ideas, let me know!