[RFC] Query insights framework

Is your feature request related to a problem? Please describe. OpenSearch stands as a versatile, scalable, open-source solution designed for diverse data exploration needs, ranging from interactive log analytics to real-time application monitoring. Despite its capabilities, OpenSearch users and administrators often encounter challenges in ensuring optimal search performance due to limited expertise or OpenSearch's current constraints in providing comprehensive data points on query executions. Common questions include:

Identification of top queries within a specific timeframe (“what are the top queries in the last 1 hour”).
Profiling users with the highest search query volumes (“how do I associate queries to users”).
Concerns about slow search queries (“why my search queries are so slow”).
Spikes in query latency (“why there was a spike in my search latency chart”).

The overarching objective of the Query Insights initiative is to address these issues by building frameworks, APIs, and dashboards, with minimal performance impact, to offer profound insights, metrics and recommendations into query executions, empowering users to better understand search query characteristics, patterns, and system behavior during query execution stages. Query Insights will facilitates enhanced detection, diagnosis, and prevension of query performance issues, ultimately improving query processing performance, user experience, and overall system resilience.

Let's discuss the scope and components of the framework!

Describe the solution you'd like

As we briefly discussed in this RFC, We want to design and build a robust framework that efficiently handles data collection, storage, processing, and export for query insights data. We need to build this framework in a resource efficient manner to minimize the impact on search performance. Also, we need to focus on the extensibility of the framework to ensure new metrics and the associated analysis and insights associated can be added easily.

The framework should have these main components: data collection, data storage and process, recommendation engine, and data export.

Collectors: Within OpenSearch, these components gather performance-related data points at various stages of search query executions.
Processors: Built in the Query Insights Plugin, these components perform lightweight aggregation and processing on data collected by the collectors.
Recommendation Engines: These components generate recommendations based on point-in-time query insights data within a cluster.
Customer experience: Various customer touch points, such as APIs, dashboards, metrics, and exporters, facilitate the presentation of insights and recommendations to customers.

The interactions between these components are illustrated in the chart below.

Data collection workflow, executed by request listeners, span listeners, or other components, channels information to one or more in-memory storage units for further analysis and post-processing. Subsequently, asynchronous processors kick in and analyze the data, generate insights and results (potentially utilizing stored historical data) - the query insights dashboard will also be using the analyzed and aggregated data to display the query insights charts. After that, the results will be handled by certain asynchronous exporters to export to different sinks.

Describe alternatives you've considered As discussed in this comment of the Top N query RFC, we can potentially leverage the OPTL collector when it becomes available and migrate certain aggregation logic from the query insights components to OPTL collectors outside of OpenSearch process. With this approach, we can send traces/spans to OPTL collectors, where the collector takes responsibility for necessary calculations, aggregations and export. This strategy could further reduce the impact on the OpenSearch process.

Additional context Some interesting discussions around this topic in the comments of: https://github.com/opensearch-project/OpenSearch/issues/11186

I want to further elaborate on the collectors and processors concepts in Query Insights scope.

Collectors

Data collectors retrieve performance-related data during different phases of search query execution. While there are various types of collectors based on feature requirements, they should all belong to three major types:

Search Request Listener-based Collectors:

Utilizing the recently introduced SearchRequestOperationsListener and the support for dynamically adding the listeners, We are now able to get information in each search phases and surface it to different workflows. Depending on use cases, listeners with corresponding metrics collection workflows will be implemented for Query Insights. At the end of each phase or each search query, the listeners will forward the collected data to one or more in-memory storage units asynchonously for further analysis and post-processing.

Alternatively, collected data can be stored in a Response Context, utilizing a global search pipeline processor to asynchronously forward all data to query insights processors. This minimizes duplication of metrics collection in different listeners and addresses challenges associated with reusing metrics by decoupling listeners with specific features.

OpenTelemetry-based Collectors:

The OpenTelemetry span listeners can also be used as data collectors for query insights features. Leveraging the resource tracking framework, we are able to capture request-level resource usages (e.g. CPU and heap usage). Similar to the Search request listener-based collectors, The data collected by the span listeners will also be forwarded to the processors asynchonously.

OpenTelemetry Metrics

OPTL Metrics also serve as a crucial data source that Query Insights features can depend on. The metrics are collected within OpenSearch and forwarded to OPTL collectors. Query Insights features can leverage those OpenTelemetry Metrics to generate valuable insights and recommendations as well. To learn more about OpenTelemetry and distributed tracing in OpenSearch, please refer to this issue.

Performance Analyzer collectors:

As an OpenSearch plugin, the open-source Performance Analyzer (PA) is one of the key autotune components that collects fine-grained system and service level metrics (see exhaustive metrics list here) from the OpenSearch cluster. Integrating PA metrics into the Query Insights and recommendations generation workflow involves correlating each request with its corresponding PA metrics datapoints. This correlation provides access to detailed thread-level resource metrics, significantly enhancing the depth and precision of insights delivered to users.

Processors

As part of the query insights plugin, the processors perform lightweight aggregation and processing on data gathered by collectors. Backed by the OpenSearch internal thread pool, these processors operate asynchronously in nature. They store and analyze point-in-time data and generate point-in-time insights. Specific processors will be implemented based on the needs of different query insights features. To illustrate, consider the Top-N queries query insights feature: a dedicated processor is implemented to efficiently store query data with latency and resource usage information in a priority queue (up to N query data points).

The detailed interactions between collectors and processors are visually depicted in the diagram below.

Alternatively, As discussed in this comment of the Top N query RFC, we can potentially build a customized OpenTelemetry collector and implement certain aggregation and processing logic in the OpenTelemetry collector outside of OpenSearch process. With this approach, we can send traces/spans to OPTL collectors, where the collector takes responsibility for necessary calculations, aggregations and export. This strategy could further reduce the performance impact on the OpenSearch process. However, this approach introduces certain overhead in creating a customized OpenTelemetry collector, and also reduces the ability to expose the calculated insights with an API or dashboard. Therefore, a thorough evaluation of factors like feature availability, recommendation SLA, and cost is very important when determining the preferred insights processors.

On the recommendation side, we can potentially utilize Performance Analyzer RCA to offer simple rule-based recommendations. An example scenario is identifying an unbalanced query due to the inadvertent use of the "_routing" parameter. As Illustrated in the above architecture diagram, The RCA agent will be responsible for reading and interpreting query insights data, integrating with PA metrics, and generating recommendations. These recommendations can be write back to the cluster through the query insights plugin, making them accessible through the dashboard. Or, they can be simply exposed through an API of the RCA agent.

Alternatively, we can embed recommendation rules directly within the Query Insights Plugin. While this streamlines the point-in-time recommendation generation workflow, it introduces a performance impact on the OpenSearch process. But ideally, the core OpenSearch process should only handle metric instrumentation and lightweight post-processing, with more resource-intensive tasks like data analysis, correlation, and recommendation generation moved outside the core, minimizing impact on critical processes. We can implement certain simple rule-based, query specific recommendations for only the top queries, and benchmark the performance impact there.

Choosing between these approaches also requires careful consideration of recommendation SLA, cost, and availability. A thorough evaluation of trade-offs is essential to determine the most suitable approach :).

This is a phenomenal effort that will greatly contribute to understanding and improving query processing performance!

I wanted to mention here that there is a complementary effort underway to understand the behavior of users of search -- especially what they do with the search results after they are returned. How often do they not click on anything? How often do they click on result #3? Do some kinds of queries perform better than others? This data is extremely valuable for tuning search ranking, either manually or using machine learning (e.g. Learn To Rank).

The two kinds of information -- server-side and user-side -- can often be useful together, so we need to be able to correlate/join front-end and back-end processing. Your comments at User Behavior Logging and Insights #12084 will be most welcome.

Quick question here - why are we creating plugins inside the project? is there a plan to do either of the following:

extract this plugin to a separate repo
introduce the code to core by default

Quick question here - why are we creating plugins inside the project? is there a plan to do either of the following:

extract this plugin to a separate repo

introduce the code to core by default ?

We may do number 1, but unlikely to do number 2. (But maybe we'll do number 2? The future is uncertain.)

If we want to do number 1, the plan is essentially just "Move the source to a separate repo". Functionally, a plugin in the OpenSearch repo is no different from a plugin in a separate repo. The real distinction tends to be more bureaucratic (who has maintainer rights over the repo?) and more build-related (it's yet another cat that needs to be herded as part of a release). This could also be a totally reversible decision (i.e. a plugin can move either way between the core repo and a separate repo without impacting the user experience -- either way, it's a plugin that you need to opt-in to).

We're probably less likely to do number 2 (I think -- but may be wrong). If we did, it would more likely be a move from /plugins/ to /modules/, where the distinction is that modules are all loaded by default. The modularity -- query insights depends on core, but core doesn't know about query insights -- is (IMO) a good thing that we wouldn't want to lose by adding query insights into the giant ball of wax that is the /server/ directory. Arguably, if we go into a module, it's a less reversible decision (once something is loaded by default, it's harder to take it away without breaking people).

I'm of the opinion that the current "neither here nor there" approach is "fine" and it's really easy (mostly file-moving) to go with one of the other approaches down the line, if it makes sense.

@AmiStrn I'm curious to hear if you have an opinion here that motivated your question? From my perspective, I agree with what @msfroh said above and my general thought here is that we should work towards reducing the cat-herding overhead so that the external repo is the right choice for most features.

I am very much in favor of moving it outside the core. I am unfamiliar with this particular plugin, and I wanted to give the discussion some space in case it was something that should have been a core feature.

adding @reta's commented on a different place i posted (wasn't sure where I would get an answer for this one).

@msfroh

The real distinction tends to be more bureaucratic (who has maintainer rights over the repo?)

I am not underestimating that at all. we want to encourage people to become maintainers, if they want to write a plugin and maintain it that makes sense. It makes less sense that the core maintainers will have to take responsibility for this feature, or that the person who wrote it becomes a core maintainer even though they wrote code that is not considered "core".

This is as bureaucratic as it is technical, having some standards makes navigating the project (more) predictable and easy.

@andrross

we should work towards reducing the cat-herding overhead

I agree, though the pain of having the core tests pipeline fail due to a plugin vs having the plugin fail on its pipeline and have dedicated maintainers handle it may outweigh the building overhead.

@AmiStrn thanks for the input. Like msfroh mentioned, the decision to implement query insights in core was mostly related to build and release. There's no dependency on this plugin from core and it's a 2 way door. I definitely agree with you on the benefits of moving plugins to their own repo. Let me create a issue to move query insights out of core.

opensearch-project / OpenSearch

[RFC] Query insights framework #11429

Collectors

Processors