Add GIN index on `splits.tags` (PostgreSQL metastore)

trinity-1686a commented 6 months ago

currently we create no index on any postgres columns, but some indexes could be useful. I think split is the table with the most potential for indexes: it's often the biggest.

it's index_uid column is used in every query, there is probably some substantial gain from having a simple hash index on that:

CREATE INDEX index_splits_index_uid ON splits USING HASH (index_uid);

other column should probably be looked at too, but would likely have a less sever impact

trinity-1686a commented 6 months ago

we addad an index on splits(index_uid), but should still investigate other possible indexes

trinity-1686a commented 6 months ago

I've experimented with creating different indexes, with different level of usefulness:

an index on split_state doesn't appear to get used (probably too low selectivity)
BRIN on time_range_start and time_range_end wasn't as successful as I expected. I assume this is because in the young days of a split, it can get deleted for a merge, leaving holes behind, which get filed by some new split with different timestamps... however a BTree seems to work well
on tags, I've experimented with multiple kind of indexes. Most aren't useful as we want to match one of the values, not the whole set of tags. A GIN can do that, with caveats below
columns in *_timestamp could use a BTree to make some bookkeeping tasks faster, but that doesn't seem to important, gaining a few 100ms there is not noticeable in practice as these are background tasks

GIN on tags: we currently match tags with statements looking like NOT ('field-name!' = ANY(tags)) OR 'field-name:the-value-i-want' = ANY(tags). For some reason, Postgres won't use the index for a = ANY(_) comparison. It does when doing '{field-name:the-value-i-want}' && tags (&& being the 'overlap' operator, returning true if there is anything in common between the list of 1 element we provide, and the list of tags in a split). Postgres also refuses to use an index when a negation is involved (NOT ('field-name!' = ANY(tags)) or NOT ('{field-name!}' && tags). We use that part of the query to match splits for which tags were not generated for this field. field-name! means field-name was recorded, it's absence means we must search the split as we have no idea what values the split may contain. However if I create another tag, let's call it !field-name!, complementary with field-name! (i.e. that fields means "that field wasn't tagged in this split", instead of "that field was tagged in this split", and query using that instead, (so the query becomes '{!field-name!}' && tags OR '{field-name:the-value-i-want}' && tags, postgresql uses the index, going from a ~330ms to a 0.7ms query time on my tests (with ~430k splits). Changing how we represent "this split (doesn't) have information for this tag" can thus have a huge impact. However it comes with a few difficulties. We'll have to write a migration to create the require tag on old splits. We will also have to do something of the kind for the file/s3 based metastore. We also loose a nice feature of the current design: adding a new tagged field to an index is currently just adding it to the index config. After that change, it will be necessary to also add something to all old splits when doing such an update. It'll be manageable once we have proper APIs to update an index, but right now it would be rather annoying doing it manually

tldr: adding BTree on time_range_* would help for queries with timebounds. adding a GIN on tags would be useful on queries with tags, but will need a fair bit more work to get working. These are on the critical path for search, so the speedup they can provide seems rather nice to have. Other indexes on splits either don't seem to get used, or improve performances in places where it has a limited impact.

guilload commented 6 months ago

Let's add BTree indexes on time_range_* fields for now.

trinity-1686a commented 5 months ago

we've added some indexes already. We'll want to do something about tags at some point, but that requires making further changes, and the situation already improved a lot. Going to unassign myself for now

quickwit-oss / quickwit

Add GIN index on `splits.tags` (PostgreSQL metastore) #4658