Open sandeshkr419 opened 7 months ago
So I started thinking of some ideas.
One of the ideas which came was to take intersection of document set for postings data for 2 fields (in case there are 2 fields involved in a multi-term aggregation), but when doing some basic math around time complexity, it turns out that the resultant time complexity might be greater than the present approach of iterating though all documents in a a match-set. Also, taking intersection of 2 postings data only works for match-all top level query with no document deletes.
As an extension to the above strategy, we also thought on the lines of if somehow we can cut-short some of the intersections looking at the terms frequency. The idea was to get rid of buckets with low cardinality, but then the problem was that those quick terminations can be made only at a segment level and if the fields values are not so uniformly distributed, then we might get rid of buckets which may have high cardinality in other segments.
Let me see if I can find more ways to see possible optimizations.
Linking https://github.com/opensearch-project/OpenSearch/pull/14993 here as it contributes to improving multi-terms aggregation.
Starting this thread to discuss ideas for optimizing multi-terms aggregation.
Sample query:
Current flow overview: For each document, increment the count of composite (formed using multiple fields) bucket.
Initial ideas for optimization: Trying out to see if for certain scenarios, will it make sense to start the execution from the postings data instead. For example, taking into account the possible buckets and then finding intersection among different buckets to find intersection of documents. Finding doc intersections for different fields is something which we can experiment out to find if it makes any advantage than the current workflow in terms of performance.