Rare Terms Aggregation Performance Optimization

opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.

https://opensearch.org/docs/latest/opensearch/index/

Apache License 2.0

9.05k stars 1.67k forks source link

Rare Terms Aggregation Performance Optimization #13122

Open sandeshkr419 opened 3 months ago

sandeshkr419 commented 3 months ago

Unsure about existing performance of Rare Terms Aggregation at the moment, but looking through initial code at high level, it looks like that this aggregation also utilizes iterating through each document.

The idea is to utilize the terms frequency from Lucene similar to https://github.com/opensearch-project/OpenSearch/pull/11643 and avoid iterating through individual documents.

Next Steps:

Measure/gather existing performance of rare terms aggregation
Improve upon the implementation if it can be done with above ideation

peternied commented 2 months ago

[Triage - attendees 1 2 3 4 5 6]

This looks like a duplicate of https://github.com/opensearch-project/OpenSearch/issues/13124

@sandeshkr419 Lets make these issues distinct if they need to be tracked separately, but overall idea capture around aggregation perf seems like a single topic

sandeshkr419 commented 2 months ago

Hi @peternied - keeping these issues separate since the underlying search operations, their code flows and ideas to optimize will be different. They do fall under the aggregation category and there is a probablity that these may share some optimization ideas but for now lets track each of them separately without one being influenced by the other.