tconbeer / harlequin-odbc

An ODBC adapter for Harlequin
MIT License
2 stars 0 forks source link

Catalog performance issues #12

Open bdholder opened 3 months ago

bdholder commented 3 months ago

I have to deal with a database that contains roughly 13,000 tables and 973,000 columns. Perhaps unsurprisingly, in its current design, loading the catalog takes about 30 minutes. After inspecting the code, I traced the cause to the tight loop that repeatedly queries for column information, and, after doing some additional research into how ODBC works under the hood, discovered a better method. Some sort of lazy loading technique would be preferred, but there is still an easy win since it appears that an adapter redesign is in progress.

One can call the columns method on a pyodbc cursor without arguments, in the same way that tables is called, and it will return "all" columns in a single request. The only data missing is the table type, and that can be pulled from the relatively inexpensive call to retrieve the tables, with the disadvantage (?) that not every "column-containing item" is returned by tables. I've encountered such things as temporary tables missing, but I don't think it's a big deal.

Iterating through columns also has the advantage that the data is already sorted:

SQLColumns returns the results as a standard result set, ordered by TABLE_CAT, TABLE_SCHEM, TABLE_NAME, and ORDINAL_POSITION.

SQLColumns Function

The only potential issue that I can see is that "if an ordinary argument is set to a null pointer and the argument is not a required argument, the argument's behavior is driver-dependent." (Ordinary Arguments) The only argument where this is an issue is the CatalogName argument to columns, but the drivers that I've tested this on default to returning all data. I don't think this is immediately pressing myself.

I've implemented a proof of concept using a single call to columns, and based some informal experiments, the result is about a 10x speedup, making the catalog retrievable in a passable 3 minutes rather than 30 minutes. The processing of the data in Python for that large dataset takes about 4.5 seconds. That could possibly be shaved down a bit, but since the data transfer swamps that time, it doesn't seem worth spending much effort on.

I'd be pleased to contribute the code if you could give me an idea of any tests you'd like to see. Incidentally, it also fixes a bug that caused a crash once the catalog did finally load. I traced that problem to an autocompletion built from the catalog here. The HarlequinCompletion label needs to be a string, but the label, constructed here, was None, which eventually triggered an exception when the label was lowercased. This particular ODBC driver returns None as the catalog name, which turns out to be legal behavior for TABLE_CAT and TABLE_SCHEM values according to the standard.

Thanks for the cool project!

tconbeer commented 1 month ago

Harlequin now supports catalog lazy-loading, but it hasn't been implemented for this adapter yet. For an example implementation, see the duckdb implementation: https://github.com/tconbeer/harlequin/blob/09aa2a5858f8efa021ae317e762a4bdcfe2296a9/src/harlequin_duckdb/adapter.py#L138

and: https://github.com/tconbeer/harlequin/blob/main/src/harlequin_duckdb/catalog.py