Provide a lazy catalog loading

hafenkran / duckdb-bigquery

Integrates DuckDB with Google BigQuery, allowing direct querying and management of BigQuery datasets

MIT License

58 stars 3 forks source link

Provide a lazy catalog loading #19

Closed Kayrnt closed 2 months ago

Kayrnt commented 2 months ago

Right now, if we consider we want to query my_project.my_dataset1.my_table2, then the current behavior is to:

List all datasets in my_project (1 single remote call)
List all tables in the dataset my_dataset1 (N remote calls for a dataset with N tables)

Then all those calls are sequential, so it can be pretty slow (like few minutes for datasets with hundreds of tables).

A lazy approach could be to just try to read the entries required by just doing a single GetTable call on tables involved.

hafenkran commented 2 months ago

Yes, I addressed this problem in the documentation. The current approach is to either isolate on a specific dataset or you can even use a direct scan on tables without ATTACH-ing. I'm not sure if a full lazy approach is practical or even doable as many operations already require the table schema being set before query execution. Finally, as there are no complaints yet, I would rather stick to keep it simple here ;)

github-christophe-oudar commented 2 months ago

'm not sure if a full lazy approach is practical or even doable as many operations already require the table schema being set before query execution.

I'd cache whatever I could and read the least possible from APIs + parallelize calls 😅

Finally, as there are no complaints yet, I would rather stick to keep it simple here ;)

I guess hardly anyone is aware of the extension yet 😉 But I'm already complaining if you need any excuse 😄