[RFC] High Level Vision for Core Search in OpenSearch

jainankitk commented 1 year ago

Introduction

This issue proposes high level vision for core search within Opensearch. While the initial common themes are identified from customer feedback for Amazon Opensearch Service, any comments about the overall direction are welcome for making it applicable to more diverse use cases

Background

Search component forms the core of applications based on Opensearch. The ability to search data efficiently is critical enough to influence the data storage format during indexing keeping search first. Compared to traditional search engines, Opensearch supports insightful aggregations in addition to filtering allowing customers to do much more and visualize their data better

Common Themes

Resiliency - The search resiliency is one of the key pain points for customers and can result in low availability due to single/few search queries
Performance - The performance aspect can be categorized into single query execution latency and overall throughput of cluster (ability to handle more load with less resources)
Visibility - Customers have limited to no visibility into query execution today. Root causing issues related to query latency are very difficult and time consuming
Control & Management - It is hard to manage resources for search alone as it is strongly coupled with indexing, merge, snapshot components. These competing resources can impact each other adversely limiting our ability to make better backpressure / admission control decisions
New Features - While continuing to improve the performance / resiliency of existing product, need to offer more features like Query Prioritization & Scheduling

Proposal

I am capturing below some of the projects that can be prioritized or are being worked upon in other RFCs. Feel free to link any other ideas / projects already in flight for Core Search

Resiliency - The cluster can go down even with single query today, which should not be allowed. Recovery from node failures in OpenSearch is an expensive process and can lead to cascading failure. Hence, it makes sense to cancel/kill aggressively and what not to ensure the node stays up. Search back pressure is good first step in direction of making it more robust. The key component is the lightweight resource tracking that can be leveraged to do:

Query Sandboxing - We have been discussing sandboxing for restricting the resource foot print in context of single query, but sandboxing makes more sense at bucket level. We can define per bucket thresholds configurable as a cluster setting to preemptively kill a query from the bucket for which threshold is hit. Every query belongs to the default bucket and can belong to one of the customer defined custom bucket. This will be particularly helpful for the multi-tenant/multi-user workloads where rogue queries from one customer can impact the other customer. The work is being tracked as part of RFC - https://github.com/opensearch-project/OpenSearch/issues/11061
Hard query cancellation - The current implementation of search backpressure relies on the acknowledgement of cancellation signal by problematic query. However, in many cases that does not happen. Hence, we need better query cancellation mechanism. The plan is to first track how many such cases are there using search backpressure as reference and track the common code paths where query cancellation gets stuck. The work is being tracked as part of RFC - https://github.com/opensearch-project/OpenSearch/issues/6953
Query Admission control - We have basic admission control in terms of search queue on every node which is at shard level and fixed size or at cluster level request when node is under duress. We need to improve the mechanism as single query can hit 10 / 100 / 1000 shards. The admission control mechanism needs to be adaptive accounting for query cost

Performance - We need to look at ways of inherently improving the query performance. This becomes even more critical with the decoupling of storage and compute, remote storage performance is nowhere close to the hot counterpart

Disk based caching - Opensearch has multiple caches for query, fielddata readily available for quick access. But, these caches reside in JVM heap due to which the space is limited and starts eviction process. Instead, we can overflow the cache to disk which is really useful for timeseries workload as old data seldom changes. Work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/9001
Query rewriting - The query performance can be significantly improved by providing query execution hints to prefer some data structures over the other. For example - fetching few fields from doc values instead of stored fields or _source is much more efficient by reading lesser data from disk and avoiding expensive data decompression. Or using map execution_mode instead of global ordinals. RFC - https://github.com/opensearch-project/OpenSearch/issues/12390
Concurrent segment search - Every shard can have multiple segments and they are searched linearly by default. Lucene recently added support for searching those segments concurrently. That should give significant boost for individual queries and work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/6798

Visibility - Customers have very limited to no visibility into query execution today. Root causing issues related to query latencies are very difficult and time consuming

Coordinator slow logs - The slow logs is currently at the shard level which does not necessarily (mostly) translate to end customer request latency. It does not differentiate at all between single query hitting 10 shards vs 1000 shards. That makes it really hard for customers to come up with right thresholds for enabling and using the slow logs. Ideally, they should be able to directly correlate the thresholds with their SLAs and figure out the offenders. This feature along with query tracing can be very insightful for our customers. The work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/7334
Query Tracing - One of the most common challenges for customers is to identify why only few queries ran with higher latencies. The aggregated node level metrics do not reveal much when a fraction of queries show higher latencies. Customer should be able to define varying degrees of tracing, to see if there are latency issues within specific clauses of query execution, and apply the optimizations at a query level
Latency breakdown metrics - The latency metrics exposed at node level today are just indexing and search. The increase in search latency hard gives any clue about what the culprit might be (prefilter, query, fetch and aggregation). It also fails to give any clue about the kind of workload it is. Hence, proposal it is to break down single monolithic took_time into individual time taken across multiple phases. The work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/7334
Query profiling - The query profiling is powerful tool giving useful insights into execution plan of query, breakdown of took time across different clauses of the query. However, it only works well for query phase of execution and provides very limited insights into fetch and aggregation phase. Enhancing this framework allows users to get customer insights into the problematic query execution and associated costs. https://github.com/opensearch-project/OpenSearch/issues/1764
Query Planning - The planner provides peek into the query execution plan along with estimated costs. The cost can include both time and resources like CPU/Memory/IO. Hence, the preferred plan after semantic analysis of the query can be optimized for time or even CPU/Memory/IO

New Features

Query Prioritization - Different search queries have different priorities and implications. Search query triggered from customer facing interface is much more critical than the one triggered by an operator sitting at the backend. Currently the queries are processed FIFO, which is not ideal when the cluster is under load and queues are backing up. Opensearch has notion of priority for cluster state update tasks. We can use something similar at coordinator and node level to treat query as high/medium/low priority and process accordingly. Customer can set cluster default and override the default for specific set of queries. https://github.com/opensearch-project/OpenSearch/issues/1017
Query scheduling - Not all queries are same. Some queries are expensive but don’t need result immediately. Customers might want to schedule some of their expensive queries during low traffic time and look at the results later. We need to build some logic for scheduling the queries and probably can leverage mechanism similar to async search for saving the results

Next Steps We will incorporate the feedback from this RFC into more detailed proposal / high level design for different themes. We will then create meta issues to go in more depth for the detailed design

Bukhtawar commented 1 year ago

Thanks @jainankitk sounds exciting. Some items like Query prioritisation and scheduling #1017 are themselves pending prioritisation 😎

jainankitk commented 1 year ago

Thanks @Bukhtawar for sharing that issue. Linking the issue in overview.

Some items like Query prioritisation and scheduling are themselves pending prioritisation 😎

Created this issue for holistic view of ideas in search space (any existing or new issue) and prioritize accordingly

dblock commented 1 year ago

This is quite interesting. I think we may be missing Query Planning, which may change how we do, for example, sandboxing, prioritization and scheduling.

andrross commented 1 year ago

Thanks @jainankitk! What about cost as a theme? A common problem users run into is that it can be quite expensive to have a large dataset that is searchable even if it is not actively being searched at a high rate. I know this bleeds more into separating compute and storage but it might be worth a mention in this holistic view as there might be multiple ways to attack the problem.

sohami commented 1 year ago

Thanks for putting this together.

I think we may be missing Query Planning, which may change how we do, for example, sandboxing, prioritization and scheduling.

+1

We have been discussing sandboxing for restricting the resource foot print in context of single query, but sandboxing makes more sense at bucket level. We can define per bucket thresholds configurable as a cluster setting to preemptively kill a query from the bucket for which threshold is hit.

To achieve bucket level sandboxing I think we will still need to have query level sandboxing in place such that each query in the bucket plays well with the resources given to it. With reactive cancellation mechanism we will still run into similar problems as today where oversubscription of the resources happen in the concurrent environment and triggering cancellation may be too late. One can think of the CircuitBreakers present today as a catch all bucket

kiranprakash154 commented 1 year ago

I created an RFC for Disk based caching - https://github.com/opensearch-project/OpenSearch/issues/9001

jainankitk commented 1 year ago

I think we may be missing Query Planning, which may change how we do, for example, sandboxing, prioritization and scheduling.

Good catch! The planner, in addition to providing much needed visibility, helps us make important decisions across resiliency and performance

jainankitk commented 1 year ago

What about cost as a theme?

As you rightfully mentioned, it makes more sense for compute/storage separation.

there might be multiple ways to attack the problem.

Do you have any specific problem in mind from search perspective? Maybe, improving the ratio of searchable data per compute/memory unit?

jainankitk commented 1 year ago

To achieve bucket level sandboxing I think we will still need to have query level sandboxing in place such that each query in the bucket plays well with the resources given to it.

IMO bucket level sandboxing is more generic version allowing us not only tackle query level sandboxing (every query is unique bucket), it also allows for managing resources across multiple tenants on single cluster.

With reactive cancellation mechanism we will still run into similar problems as today where oversubscription of the resources happen in the concurrent environment and triggering cancellation may be too late. One can think of the CircuitBreakers present today as a catch all bucket

The reactive mechanism today has tricky decision making component. For query sandboxing, it is simpler as what to cancel is quite clear. If we can harden the cancellation mechanism, reactive should work pretty well. Although, there are interesting scenarios where you want to serialize the query execution instead of parallelization to stay within the resource constraint bounds. For example - instead of picking 2 shards in parallel on node, do them one after the other to stay within resource bounds.

sohami commented 1 year ago

To achieve bucket level sandboxing I think we will still need to have query level sandboxing in place such that each query in the bucket plays well with the resources given to it.

IMO bucket level sandboxing is more generic version allowing us not only tackle query level sandboxing (every query is unique bucket), it also allows for managing resources across multiple tenants on single cluster.

I am visualizing buckets as a group where multiple queries can execute. If you are considering 1 bucket per query than we are talking about same thing to enforce the sandboxing essentially at query level. What I was calling out, we need to have mechanism to sandbox per query within a bucket where a bucket is a group (which I think how buckets will be used ?) instead of 1:1 mapping with a query . Like Bucket_1 (IT dept), Bucket_2 (Security dept). I want to split the buckets such that I allow resource split of 40/60. So each bucket need to ensure that when multiple queries are running in each bucket they remain under the resource constraint of each bucket.

The reactive mechanism today has tricky decision making component. For query sandboxing, it is simpler as what to cancel is quite clear..

My thought was we will basically use similar mechanism but instead of node level it will be done at bucket level. But may be will wait to see the details on this to understand better :)

If we can harden the cancellation mechanism, reactive should work pretty well

Yes, but should we always fail the request as part of sandboxing ? Can be a first step towards it, but we should also think around how we can let the query run with constrained resource and complete may be taking more time but not necessarily failing it. Reactive is one approach other can be based on cost based planning.

jainankitk commented 1 year ago

I am visualizing buckets as a group where multiple queries can execute. If you are considering 1 bucket per query than we are talking about same thing to enforce the sandboxing essentially at query level. What I was calling out, we need to have mechanism to sandbox per query within a bucket where a bucket is a group (which I think how buckets will be used ?) instead of 1:1 mapping with a query . Like Bucket_1 (IT dept), Bucket_2 (Security dept). I want to split the buckets such that I allow resource split of 40/60. So each bucket need to ensure that when multiple queries are running in each bucket they remain under the resource constraint of each bucket.

Your understanding of the bucket is correct, I was just saying that bucketing is more generic and can be used for limiting individual queries as well. The bucket level decision is slightly simpler due to two reasons 1/ The value is configured by the customer, so we don't need to determine whether node is under duress 2/ Individual bucket should have more uniformity (due to which they should be bucketed together in the first place)

Yes, but should we always fail the request as part of sandboxing ? Can be a first step towards it, but we should also think around how we can let the query run with constrained resource and complete may be taking more time but not necessarily failing it. Reactive is one approach other can be based on cost based planning.

That's a great point and I also kind of talked about this in previous comment - There are interesting scenarios where you want to serialize the query execution instead of parallelization to stay within the resource constraint bounds. For example - instead of picking 2 shards in parallel on node, do them one after the other to stay within resource bounds. That being said, we will iterate in following manner 1/ Reactive cancellation 2/ Proactive cancellation based on historic latency/cancellation metrics and query cost estimation using planner 3/ Serialize across shards to not exceed resource bounds 4/ Limit the resources within individual shard search execution

ansjcy commented 9 months ago

Thanks @jainankitk for the high level vision! Wanted to add a few points on the visibility front. We are working on a set of Query Insights features to improve the visibility: https://github.com/opensearch-project/OpenSearch/issues/11522

We are working on query categorization, which will provide users OTel metrics on summaries of search workload types. Also as shown in the meta issue, we will also work on the Query Insights plugin with Top N Queries (based on resource usages) and in the future implement the Query Insights Dashboard (Admin Dashboard) to provide even more visibility!

msfroh commented 1 month ago

Can we rename this RFC? It's focused on visibility and resiliency.

We have a lot more feature-oriented RFCs that are on the roadmap for search, and the title suggests that this is the singular vision for search.

smacrakis commented 1 month ago

Agree with Froh. How about renaming it to Performance and Resiliency: Overview?

opensearch-project / OpenSearch