CatalogItem and performance

rymurr commented 6 months ago

While working on https://github.com/rymurr/harlequin-snowflake I was noticing that the catalog generation was taking upwards of 1min for my (rather large) snowflake.

I was experimenting with ways to make the children parameter of CatalogItem lazy without much success. I thought I would raise the issue here so we could talk about the best design to allow loading of large catalogs.

alexmalins commented 6 months ago

Hey Ryan, nice work on the snowflake adapter 🙌 To confirm, is the slow part:

the SQL calls to Snowflake to fetch the information schema for the columns in each database
the Python looping and creating Harlequin Catalog entries for each column, table etc?

alexmalins commented 6 months ago

For comparison indexing the catalogs in the databricks adapter is slow when not using unity catalog, as it requires an individual SQL call for every table to find the metadata. This overhead takes much longer than the python to create the CatalogItems.

When using the unity catalog indexing, only 2 SQL queries to Databricks are required in total, so indexing is rapid even for big lakes.

So I was wondering in your Snowflake example taking more than a minute, is it because you have lots of databases meaning multiple snowflake sql queries to fetch the metadata (it's one query per database it seems from your commented out code), or only a few databases but Snowflake is just slow to return results to those queries, or because the overhead is all in Python creating thousands of CatalogItem instances with the metadata?

rymurr commented 6 months ago

Hey @alexmalins I haven't tested yet. I think its a bit of both. I have rather a lot of databases in this account so I am making ~50 requests to snowflake and having to create upwards of 50k CatalogItems.

Thinking more about it for me the real advantage to a lazy catalog for snowflake is around the current need for starting a warehouse to get metadata from information_schema. This catalog takes >1min to collect and so the user pays for several min of compute for every catalog refresh (I think this is a peculiarity of snowflakes cost model and doesn't really apply to other backends). The snowflake idiomatic way to get catalog metadata is to use show terse {databases,schemas,tables,columns} invocations which don't start a warehouse and are ~free.

tconbeer commented 6 months ago

I agree that lazy loading is a good idea for some resources, especially on some adapters, but honestly even a local duckdb or sqlite db catalog is slow when you have 10's of thousands of CatalogItems (which isn't that hard to reach).

The Catalog widget is based on Textual's built-in Tree viewer, but they have another widget, the DirectoryTree, which lazy-loads node children. We use this for Harlequin's file and s3 viewers. The tree viewer we use also has an API for adding children to a node. So the tree viewer itself isn't the issue here.

In the v1 adapter interface, we force the adapter to provide all the nodes up-front, but this was a bad choice by me. It was driven by my hatred of slow catalog panes in Datagrip and SSMS, but it clearly doesn't scale well. Caching the catalog helps, but refreshes and cold starts are painful for a db of any real size, for most adapters.

For a v2 adapter interface, CatalogItems are going to get more complicated, and won't just be data. There are a few other issues tagged "adapter v2" to give you an idea of what I'd like to support, e.g., arbitrary actions on CatalogItems #213 and arbitrary/pluggable catalogs #421

I think a general design for #213 points to the way forward here -- a CatalogItem needs to provide a method for fetching its children, so that Harlequin can decide the best time to do that. I would personally like to prefetch the first level of hidden children, so that exploration can be faster, and you don't have to wait for a sql query to complete every time you expand a tree node.

alexmalins commented 6 months ago

@rymurr my experience from the Databricks side is the speed of building a Catalog is determined mainly by the communications with the server, and not Python instantiating the CatalogItem objects.

E.g.

It takes a couple of seconds to index on the order of 50k CatalogItems when using the unity catalog indexing, which only requires two SQL calls. Most of this is overhead for the server to execute the SQL.
When not using unity catalog indexing to create the same number of CatalogItems then indexing takes 10s of minutes. It's because each table (order 5k tables) requires its own SQL query.
Just mocking the creation of CatalogItems locally I can create a million items in about a second:
```
from harlequin.catalog import CatalogItem
```

items = [] for i in range(1, 1000000): items.append( CatalogItem( qualified_identifier=( f"{i}.{i}.{i}.{i}" ), query_name=i, label=i, type_label=i, ) )



So if the lazy indexing Snowflake solution means a switch from ~50 total server communications to one communication per item expand in the Harlequin Data Catalog pane, it's likely the time to index everything will become slower. But that's no practical problem for users who probably won't be doing that (i.e. opening all leaves in the tree), as long as the communications don't cause the Data Catalog pane to feel too laggy as items are expanded for the first time.

For the Databricks adapter, Ted, it should be fine for me to implement the new lazy API in adapter v2. But in practice for the unity catalog indexing, it will make sense for me to store a pyarrow table with all the table+col metadata on each refresh, then the lazy items query that table locally and not the server each time a user expands an item in the Data Catalog pane. I.e. "fake" lazy loading, as it'll just be much faster that way.

tconbeer commented 1 month ago

Getting very close on this. The basic design is a new subclass of CatalogItem called InteractiveCatalogItem that defines a fetch_children() method to lazy-load its children. Harlequin will pre-fetch the children for any visible nodes. This seems to work really well with DuckDB; I need to do more testing with higher-latency databases.

More exciting, InteractiveCatalogItems will also get to define a list of Interactions, which populate a context menu per #213 . This is really quite cool. Here's a rough and ready demo:

https://github.com/user-attachments/assets/0675586d-ede0-43f8-ae6f-641e53c349f7

alexmalins commented 1 month ago

looks sweet 💯

fetch_children() will definitely improve the UX for Databricks users with hive metastores

I'm just about to ship an update to harlequin-databricks to implement query cancelling

tconbeer / harlequin

CatalogItem and performance #491