Streaming Catalog - Githubissues

dominiklohmann commented 7 months ago

The catalog is the central component of a Tenzir Node that owns all partitions. Given an expression, it performs a sparse index lookup to return candidate partitions for evaluation. Currently, the catalog lookup scales linearly with the number of partitions.

For nodes that manage lots of partitions, this can easily become a problem, as the catalog lookup time is a constant in all exports. We want to change the catalog to stream its candidate partitions to make initial results arrive quicker.

### Stories
- [ ] https://github.com/tenzir/issues/issues/1070

dominiklohmann commented 7 months ago

This was (indirectly) requested by a customer—they have a node with over a million partitions. The catalog lookup for them takes slightly under 4 seconds for some queries I tested.

mavam commented 7 months ago

We should note that this is a band-aid fix. There is a related architectural discussion to have that turn the O(n) ideally into an O(log(n)) problem.

dominiklohmann commented 5 months ago

We're merging this with https://github.com/tenzir/public-roadmap/issues/123; closing this for now.

tenzir / public-roadmap

Streaming Catalog #123