Open trinity-1686a opened 6 months ago
we addad an index on splits(index_uid), but should still investigate other possible indexes
I've experimented with creating different indexes, with different level of usefulness:
split_state
doesn't appear to get used (probably too low selectivity)time_range_start
and time_range_end
wasn't as successful as I expected. I assume this is because in the young days of a split, it can get deleted for a merge, leaving holes behind, which get filed by some new split with different timestamps... however a BTree seems to work well*_timestamp
could use a BTree to make some bookkeeping tasks faster, but that doesn't seem to important, gaining a few 100ms there is not noticeable in practice as these are background tasksGIN on tags: we currently match tags with statements looking like NOT ('field-name!' = ANY(tags)) OR 'field-name:the-value-i-want' = ANY(tags)
. For some reason, Postgres won't use the index for a = ANY(_)
comparison. It does when doing '{field-name:the-value-i-want}' && tags
(&& being the 'overlap' operator, returning true if there is anything in common between the list of 1 element we provide, and the list of tags in a split).
Postgres also refuses to use an index when a negation is involved (NOT ('field-name!' = ANY(tags))
or NOT ('{field-name!}' && tags)
. We use that part of the query to match splits for which tags were not generated for this field. field-name!
means field-name
was recorded, it's absence means we must search the split as we have no idea what values the split may contain. However if I create another tag, let's call it !field-name!
, complementary with field-name!
(i.e. that fields means "that field wasn't tagged in this split", instead of "that field was tagged in this split", and query using that instead, (so the query becomes '{!field-name!}' && tags OR '{field-name:the-value-i-want}' && tags
, postgresql uses the index, going from a ~330ms to a 0.7ms query time on my tests (with ~430k splits).
Changing how we represent "this split (doesn't) have information for this tag" can thus have a huge impact. However it comes with a few difficulties. We'll have to write a migration to create the require tag on old splits. We will also have to do something of the kind for the file/s3 based metastore. We also loose a nice feature of the current design: adding a new tagged field to an index is currently just adding it to the index config. After that change, it will be necessary to also add something to all old splits when doing such an update. It'll be manageable once we have proper APIs to update an index, but right now it would be rather annoying doing it manually
tldr: adding BTree on time_range_*
would help for queries with timebounds. adding a GIN on tags would be useful on queries with tags, but will need a fair bit more work to get working. These are on the critical path for search, so the speedup they can provide seems rather nice to have. Other indexes on splits
either don't seem to get used, or improve performances in places where it has a limited impact.
Let's add BTree indexes on time_range_*
fields for now.
we've added some indexes already. We'll want to do something about tags at some point, but that requires making further changes, and the situation already improved a lot. Going to unassign myself for now
currently we create no index on any postgres columns, but some indexes could be useful. I think
split
is the table with the most potential for indexes: it's often the biggest.it's
index_uid
column is used in every query, there is probably some substantial gain from having a simple hash index on that:other column should probably be looked at too, but would likely have a less sever impact