Add optimization for pure count and count aggregation

fulmicoton commented 3 months ago

as observed on airmail

We are slower than Opensearch on

_count on AllQuery
_count queries on termqueries
aggregation on AllQuery like the one as follows

{
    size: 0
  , aggs: {
      hosts: {
        terms: {
          field: '_host'
        , size: 50
        }
      }
    , apps: {
        terms: {
          field: '_app'
        , size: 50
        }
      }
    , ingesters: {
        terms: {
          field: '_ingester'
        , size: 50
        }
      }
    , tags: {
        terms: {
          field: '_tag'
        , size: 50
        }
      }
    }
  }

All of the above can be addressed just looking at the split metadata, and/or the termdictionary.

fulmicoton commented 3 months ago

@PSeitz can you take over this issue.

You can ditch or use #5034 if you want. For the last part: 'term aggregations on all query', it would be nice to detect that this can be addressed via the termdictionary alone..

Problem 1) It will require creative software engineering between tantivy and quickwit as the warm up is totally different.

Problem 2) multiplicity of the same value will actually cause different different results if we have multivalued columns (and the same value can appear more than once)

Problem 3) Technicallly, doing it using the column is faster if the term dictionary has a lot of distinct values, and the size of the aggregation is limited. In practise, we can probably just do a simple rule on the size of the term dictionary. (e.g.: if term dictionary < 500KB or < column).

========

Other optim opportunity it would be nice to have an optim on aggregation over columns for all queries.

We could directly operate over values, instead of filling a doc id buffer, and fetching those.

PSeitz commented 3 months ago

Count All

_count with AllQuery works already. Uses only metadata.

Implemented here: https://github.com/quickwit-oss/quickwit/pull/4410

Count All + Top N

_count with AllQuery + top_n: Does a full search currently.

To support _count + top n, we could go for a hybrid approach and get the counts from metadata split and the top n from a selected split. Question would be which top n we get should get for the AllQuery (it's sorted by the highest split id)

Implemented here: https://github.com/quickwit-oss/quickwit/pull/5075

Count All + Top N + Sort By Date

This optimization could also be extended _count with AllQuery + top_n, sort by timestamp.

Identify splits which can't contain data from the top n, with the time_range metadata:

/// If a timestamp field is available, the min / max timestamp in
/// the split, expressed in seconds.
pub time_range: Option<RangeInclusive<i64>>,

Implemented here: https://github.com/quickwit-oss/quickwit/pull/5075 (detect which splits will deliver the results) https://github.com/quickwit-oss/quickwit/pull/5048 (passing threshold between splits during search)

Count All + Top N + (Sort By Date) + Time Range Query

https://github.com/quickwit-oss/quickwit/pull/5048 implements this, but we could also do some early detection which splits will deliver enough results by extending https://github.com/quickwit-oss/quickwit/pull/5075

Implemented here: https://github.com/quickwit-oss/quickwit/pull/5048 (passing threshold between splits during search)

Count All + Top N + Sort By Other Field

We don't have metadata for other fields on the split metadata, so there are two options:

Similar to tags, we could flag these field so we pull out the min max values and store them in the metadata. This would then be used the same way as time_range. Could also be automatic for all defined numeric fields.
We have min max values in the fast field metadata, in combination with CanSplitBeBetter or similar, we could do some early pruning.

Not implemented

fulmicoton commented 3 months ago

I could not tell if your text is describing the current state or not. For count all, we already have an optimization in main... but one thing to consider is COUNT + TIME RANGE QUERY.

In that case, the optimization in main won't kick in. The one in this PR will.

The idea is that if a split is entirely within the range query, we transform the range query into an all query.

PSeitz commented 3 months ago

I update the comment to make it more clear on the current state implemented.

On using the term dictionary for aggregations, this would only work if the same tokenizer is used (which isn't the case by default). In the fast field dictionary we don't store counts currently. It's also quite special, e.g. it would not support any nested aggregations.

Given how special it is and how easily cache-able the query is, I'm not sure the payoff is high enough to add a new code path in the aggregations for this.

We could directly operate over values, instead of filling a doc id buffer, and fetching those.

I think we could detect if an incoming block of docids is contiguous in collect_block and do some optimized fetching of the values. In that case we would just require the start and end index.

quickwit-oss / quickwit