Open bdholder opened 3 months ago
Harlequin now supports catalog lazy-loading, but it hasn't been implemented for this adapter yet. For an example implementation, see the duckdb implementation: https://github.com/tconbeer/harlequin/blob/09aa2a5858f8efa021ae317e762a4bdcfe2296a9/src/harlequin_duckdb/adapter.py#L138
and: https://github.com/tconbeer/harlequin/blob/main/src/harlequin_duckdb/catalog.py
I have to deal with a database that contains roughly 13,000 tables and 973,000 columns. Perhaps unsurprisingly, in its current design, loading the catalog takes about 30 minutes. After inspecting the code, I traced the cause to the tight loop that repeatedly queries for column information, and, after doing some additional research into how ODBC works under the hood, discovered a better method. Some sort of lazy loading technique would be preferred, but there is still an easy win since it appears that an adapter redesign is in progress.
One can call the
columns
method on apyodbc
cursor without arguments, in the same way thattables
is called, and it will return "all" columns in a single request. The only data missing is the table type, and that can be pulled from the relatively inexpensive call to retrieve the tables, with the disadvantage (?) that not every "column-containing item" is returned bytables
. I've encountered such things as temporary tables missing, but I don't think it's a big deal.Iterating through columns also has the advantage that the data is already sorted:
The only potential issue that I can see is that "if an ordinary argument is set to a null pointer and the argument is not a required argument, the argument's behavior is driver-dependent." (Ordinary Arguments) The only argument where this is an issue is the CatalogName argument to
columns
, but the drivers that I've tested this on default to returning all data. I don't think this is immediately pressing myself.I've implemented a proof of concept using a single call to
columns
, and based some informal experiments, the result is about a 10x speedup, making the catalog retrievable in a passable 3 minutes rather than 30 minutes. The processing of the data in Python for that large dataset takes about 4.5 seconds. That could possibly be shaved down a bit, but since the data transfer swamps that time, it doesn't seem worth spending much effort on.I'd be pleased to contribute the code if you could give me an idea of any tests you'd like to see. Incidentally, it also fixes a bug that caused a crash once the catalog did finally load. I traced that problem to an autocompletion built from the catalog here. The
HarlequinCompletion
label
needs to be a string, but the label, constructed here, wasNone
, which eventually triggered an exception when the label was lowercased. This particular ODBC driver returnsNone
as the catalog name, which turns out to be legal behavior forTABLE_CAT
andTABLE_SCHEM
values according to the standard.Thanks for the cool project!