Multi-terms Aggregation Performance Optimization

sandeshkr419 commented 7 months ago

Starting this thread to discuss ideas for optimizing multi-terms aggregation.

Sample query:

{
  "size": 0,
  "aggs": {
    "important_terms": {
      "multi_terms": {
        "terms": [
          {
            "field": "process.name"
          },
          {
            "field": "cloud.region"
          }
        ]
      }
    }
  }
}

Current flow overview: For each document, increment the count of composite (formed using multiple fields) bucket.

Initial ideas for optimization: Trying out to see if for certain scenarios, will it make sense to start the execution from the postings data instead. For example, taking into account the possible buckets and then finding intersection among different buckets to find intersection of documents. Finding doc intersections for different fields is something which we can experiment out to find if it makes any advantage than the current workflow in terms of performance.

sandeshkr419 commented 6 months ago

So I started thinking of some ideas.

One of the ideas which came was to take intersection of document set for postings data for 2 fields (in case there are 2 fields involved in a multi-term aggregation), but when doing some basic math around time complexity, it turns out that the resultant time complexity might be greater than the present approach of iterating though all documents in a a match-set. Also, taking intersection of 2 postings data only works for match-all top level query with no document deletes.

Some math I brainstormed with @msfroh offline. (Expand for details)

Assume D document, field1 & field2 with with both n cardinality for simplicity. Assume uniform distribution across fields. => number of docs in each field will be D/n (1) When doing a multi-terms aggregation on field1 & field2 => n^2 buckets (2) Time to find each bucket intersection will be of O(D/n) if using postings data since the document sets are sorted so linear traversal will be required => (3) Final time complexity will be O(n*D) in that case. --- (1),(2) =>(3) Whereas if we are using value source (in present code) the complexity is (cost of fetching valuesource) * O(D). ---- (4) It seems initially that (4) < (3) as cost of fetching value source will not exceed `n` I guess. I was thinking if if time to find each bucket intersection - if we can make it substantially less than O(D/n) - then we might have a chance. Although the time complexities of 2 algorithms is not an entirely apples to apples comparison, but it does looks like that the approach might not work, but again there are some gaps which we may have not yet discovered.

As an extension to the above strategy, we also thought on the lines of if somehow we can cut-short some of the intersections looking at the terms frequency. The idea was to get rid of buckets with low cardinality, but then the problem was that those quick terminations can be made only at a segment level and if the fields values are not so uniformly distributed, then we might get rid of buckets which may have high cardinality in other segments.

Let me see if I can find more ways to see possible optimizations.

sandeshkr419 commented 3 months ago

Linking https://github.com/opensearch-project/OpenSearch/pull/14993 here as it contributes to improving multi-terms aggregation.

opensearch-project / OpenSearch

Multi-terms Aggregation Performance Optimization #13120