Open theodotdot opened 6 months ago
Seems related to https://github.com/tconbeer/harlequin/issues/576
Hi @theodotdot, I tried to reproduce the error in a Google Cloud project with more then 100k rows of tables/columns aswell, but the data catalog displays the tree as expected. It seems that the number of rows returned by the query on INFORMATION_SCHEMA
is not the reason for the problem, but maybe an edge case somewhere in the data returned by the query.
The only way I can reproduce the exact same error is when columns exist without reference to a table. Unfortunately, I have not found any information in the BigQuery documentation where this case can actually occur, nor do my projects return such results in BQ...
To fix it, we can handle it in adapter.py
, but I am interested if you can query your project with the following SQL, just to see if your projects contain such edge cases:
select
datasets.schema_name as dataset_id,
tables.table_name as table_id,
tables.table_type as table_type,
columns,column_name,
columns.data_type as column_type
from `{self.project}.region-{self.location}.INFORMATION_SCHEMA.SCHEMATA` datasets
left join `{self.project}.region-{self.location}.INFORMATION_SCHEMA.TABLES` tables
on datasets.catalog_name = tables.table_catalog
and datasets.schema_name = tables.table_schema
left join `{self.project}.region-{self.location}.INFORMATION_SCHEMA.COLUMNS` columns
using (table_catalog, table_schema, table_name)
where tables.table_name is null and column_name is not null
order by dataset_id, table_id, column_name
I just added the where clause to the original statement found in adapter.py
.
Let me know if this query returns any data.
Thanks.
Hey @thmswt, thanks for having a look!
I tried running the query that you sent but it returns an empty result so I don't think there are any columns without refs.
Recently I have been working in another GCP project within the same org and the data catalog has been working fine. I also have been testing it on a bunch of other projects but they are all much smaller in terms of number of tables, datasets etc and they all seem to work fine.
My guess would be that there is something funky happening for this project in particular that's creating an edge case but I have looked and apart from the fact that it's massive. There are about 3k tables and 150k rows worth of tables/columns (from the query you sent).
Happy to keep troubleshooting if you have other ideas, let me know!
I have installed Harlequin-BigQuery through pipx.
Running on Ubuntu 22.04
I am successfully authenticated using
gcloud
.I have the necessary permissions to run the query in the
get_catalog
function (tested on the BQ console).Queries on Harlequin work, only the data catalog produces an error on launch:
I had a look at the
adapter.py
but couldn't figure it out myself. Seems like it has something to do with the list iteration over the results of the query.I work with a big project so the produced tables from the
get_catalog
function yields 109,106 rows which might be the issue?I cannot disclose information on the actual project I am working with but let me know if you need anything else!