Search Memory Tracking - track memory used during a shard search

malpani commented 3 years ago

Is your feature request related to a problem? Please describe. There is limited visibility into how much memory is consumed by a query. In an ideal world, resource consumption details should be abstracted out from users and everything should auto-tune/auto-reject. But we are not there (yet!) and with every query treated equally, certain memory heavy queries can end up tipping the memory breakers for all requests. It will be helpful to track and surface memory consumed by a query. This visibility can help users tune their query better

Describe the solution you'd like The plan is to make this generic and expose these stats via tasks framework. Tasks Framework already tracks latency and has some context about the query/work being done. The idea is to enhance this to start tracking additional stats of memory and CPU consumed per task. As tasks have a nice parent--> child hierarchy, this mechanism will allow tracking the cluster-wide resource consumption from a query. So plan is to update Task to start tracking additional context + stats. When a task completes, this task info will be pushed to a sink. Sink can be logs or a system index to enable additional insights

For search side tracking of stats - The proposed solution is to leverage the single threaded nature of searching within a shard. I plan to use [ThreadMxBean.getCurrentThreadAllocatedBytes](https://docs.oracle.com/en/java/javase/14/docs/api/jdk.management/com/sun/management/ThreadMXBean.html#getCurrentThreadAllocatedBytes() ) for tracking the memory consumption and exposing this in 2 forms

Based on some initial rally benchmarks on a POC, the overhead does not look high. Having said that, my plan is to gate this under a cluster setting search.track_resources that defaults to false (disabled)

Describe alternatives you've considered

Instead of exposing this via the generic tasks framework, the change could have exposed this information via
- Slow logs - Search slow logs
- Node Stats - Adding a new search_stats section into /_node/stats API that returns top N expensive queries However this model is restricted to search model and will require additional work to track at a parent level, the cluster-wide impact of a query. Hence, this alternate while lesser work is not as powerful.
Performance Analyzer also tracks metrics but I did not go down that route as eventually this could serve as feedback to improve memory estimations prior to executing a query. Further the slowlog plumbing is already well defined in the core

Planning

[ ] Update Task to allow tracking additional context and stats. Expose interfaces to allow updating them (task today is immutable). On completion task resource usage stats are emitted to 'sink'
[ ] Producer - Query/fetch phases - track memory and cpu per task (updated at the end)
[ ] Consumer - Build out a Sink Interface with 'task stats log' as the first sink that emits top N resource consuming tasks
[ ] Query/fetch phases - more real-time/frequently updated stats tracking
[ ] Update tasks api to expose additional stats - sort by CPU/memory
[ ] Track coordinator side stats overhead

AmiStrn commented 3 years ago

How about having a way to stop/deprioritise memory heavy queries kind of like the way timeout for a query works?

This is different than the observability issue. But makes sense to prevent these really intensive queries to begin with. (In addition, not instead of...)

Bukhtawar commented 3 years ago

Nice proposal, maybe we need an extension for aggregation reduce phases on the coordinator as well(major contributors to memory), also being cautious about deserialisation overhead.

@AmiStrn maybe we need a special handling for query prioritization for instance async searches should have a different priority than usual search #1017. Also we might need to track/estimate memory prior to the memory allocation in order for it to be terminated early. I guess both of the above can be tracked separately. Thoughts?

malpani commented 3 years ago

@AmiStrn Today a query execution can be stopped on scenarios like hitting the bucket limit or parent breakers. There is value to adding some notion of memory sandbox and preempt the query on hitting a 'per query memory limit' as the next phase and eventually improve the memory estimation (prior to executing)

@Bukhtawar good point. This approach will not capture the reduce phase overhead and I will explore that as a follow up

malpani commented 2 years ago

Finally got some time to explore this more and here are some thoughts

The utility of exposing top N via a new search_stats section into /_node/stats API to return N most expensive queries is limited and may not help answer questions like "What queries between October 4 and 5 were most expensive in terms of their memory footprint?" as N most expensive queries might have run 60 days ago.
Implementing this via tasks framework can provide a hook to track on parent task ids and not just restrict to isolated shard level memory tracking (thanks @sohami for the idea). It also allows for other actions (not just search, if they choose to) to track memory usage. Existing tasks API already tracks latency and adding memory consumption could be useful.
On completion of task - task info which will include memory used(for search tasks) can be dumped into a sink - sink could be configurable a simple log file or a system index for further analysis.

opensearch-project / OpenSearch

Search Memory Tracking - track memory used during a shard search #1009