Open fulmicoton opened 3 months ago
@PSeitz can you take over this issue.
You can ditch or use #5034 if you want. For the last part: 'term aggregations on all query', it would be nice to detect that this can be addressed via the termdictionary alone..
Problem 1) It will require creative software engineering between tantivy and quickwit as the warm up is totally different.
Problem 2) multiplicity of the same value will actually cause different different results if we have multivalued columns (and the same value can appear more than once)
Problem 3) Technicallly, doing it using the column is faster if the term dictionary has a lot of distinct values, and the size of the aggregation is limited. In practise, we can probably just do a simple rule on the size of the term dictionary. (e.g.: if term dictionary < 500KB or < column).
========
Other optim opportunity it would be nice to have an optim on aggregation over columns for all queries.
We could directly operate over values, instead of filling a doc id buffer, and fetching those.
_count
with AllQuery works already. Uses only metadata.
Implemented here: https://github.com/quickwit-oss/quickwit/pull/4410
_count
with AllQuery + top_n: Does a full search currently.
To support _count
+ top n, we could go for a hybrid approach and get the counts from metadata split and the top n from a selected split.
Question would be which top n we get should get for the AllQuery
(it's sorted by the highest split id)
Implemented here: https://github.com/quickwit-oss/quickwit/pull/5075
This optimization could also be extended
_count
with AllQuery + top_n, sort by timestamp.
Identify splits which can't contain data from the top n, with the time_range metadata:
/// If a timestamp field is available, the min / max timestamp in
/// the split, expressed in seconds.
pub time_range: Option<RangeInclusive<i64>>,
Implemented here: https://github.com/quickwit-oss/quickwit/pull/5075 (detect which splits will deliver the results) https://github.com/quickwit-oss/quickwit/pull/5048 (passing threshold between splits during search)
https://github.com/quickwit-oss/quickwit/pull/5048 implements this, but we could also do some early detection which splits will deliver enough results by extending https://github.com/quickwit-oss/quickwit/pull/5075
Implemented here: https://github.com/quickwit-oss/quickwit/pull/5048 (passing threshold between splits during search)
We don't have metadata for other fields on the split metadata, so there are two options:
time_range
. Could also be automatic for all defined numeric fields. CanSplitBeBetter
or similar, we could do some early pruning.Not implemented
I could not tell if your text is describing the current state or not. For count all, we already have an optimization in main... but one thing to consider is COUNT + TIME RANGE QUERY.
In that case, the optimization in main won't kick in. The one in this PR will.
The idea is that if a split is entirely within the range query, we transform the range query into an all query.
I update the comment to make it more clear on the current state implemented.
On using the term dictionary for aggregations, this would only work if the same tokenizer is used (which isn't the case by default). In the fast field dictionary we don't store counts currently. It's also quite special, e.g. it would not support any nested aggregations.
Given how special it is and how easily cache-able the query is, I'm not sure the payoff is high enough to add a new code path in the aggregations for this.
We could directly operate over values, instead of filling a doc id buffer, and fetching those.
I think we could detect if an incoming block of docids is contiguous in collect_block
and do some optimized fetching of the values. In that case we would just require the start and end index.
as observed on airmail
We are slower than Opensearch on
_count
on AllQuery_count
queries on termqueriesAllQuery
like the one as followsAll of the above can be addressed just looking at the split metadata, and/or the termdictionary.