[RFC] Configurable Staleness for Search queries

Current Behavior:

Currently, when a document is updated in a shard and a refresh occurs, the entire corresponding request cache is invalidated. This is done to ensure that the response a user receives will never be stale, regardless of whether the request goes through the cache or directly to the shards.

Challenge:

Even with Tiered Caching, where we can cache more heavily, we are still at the mercy of the refresh interval of the underlying index. If there are subsets of requests that are okay with a certain level of staleness, there is no way to achieve this without creating a separate index, which adds to the cost and complexity.

Use Case:

I am an OpenSearch admin in a cloud SaaS enterprise that manages customer support data. I provide OpenSearch as a service to my engineering teams. There is an index called "Tickets" that contains all the relevant data related to customer support tickets.

All the data in my OpenSearch cluster is multi-tenant, with an account_id field in every document that is used to filter queries.

There are two services making search queries to this index:

The "Views Search" service needs the most current data possible. The "Dashboarding" service is okay with data that is up to 1 minute stale, as per their SLA, but they want the searches to be very fast.

Proposed Solution:

What if we expose a parameter called stale= (e.g., stale=30s) that a customer can include in their search query? This would signal to us that the data can be cached for the specified time period, regardless of any refreshes or document updates that may occur.

On each refresh, we would check if the cached query has exceeded the specified staleness threshold. If so, we would delete the entry from the cache and recompute the query, caching the new results for the next stale time period.

Benefits:

Cost Savings: Users can avoid creating multiple indices or setting up a separate data store (like Redis) for caching. Improved Performance: More cache hits would lead to reduced compute requirements. Flexibility: Customers can choose the appropriate staleness level for their use case, effectively prioritizing their queries. Increased Product Stickiness: This solution would increase the reliance on OpenSearch, making it a more integral part of the overall system.

Related component

Search:Performance

opensearch-project / OpenSearch