Open kaushalmahi12 opened 12 months ago
Hi @kaushalmahi12, thanks for sharing the detailed design proposal. few initial comments/queries/doubts:
we want negligible overhead on retrieving the sandboxes since this code logic will come directly on search path
this is a bit confusing, so far what i understood is we will be using the sandboxes in case of resource tracking (which will be a periodic and async task more like a separate executor service) to calculate resource consumption and cancel the tasks from the sandboxes based on priority where limits are breaching. but when you say it will come directly on search path, do you mean on every query request, we will see if the sandbox on which this query lands have enough resource to process this and cancel otherwise?
Will pollute the cluster state
Also if we need the sandbox info in an async and periodic task, can we avoid using cluster state and store it in some system index only. to avoid replication lag issues, we can route requests to the primary shards always.
Thanks Harish for putting up the questions !
we can add some usecases for this feature to understand the motivation better.
I think the RFC(https://github.com/opensearch-project/OpenSearch/issues/11061) is covering the usecases as this is the proposal around how are we planning to achieve the feautre
Do we need to make sure that no two sandboxes should have same priority?
Yes, At sandbox creation time we will be putting a validation check.
How can we determine a good active sandbox count for a cluster?
To answer this question better we will need to run opensearch benchmarks with this feature to come to a right number.
i can set it to (90%, 50%) with priority 1? in case of high cpu issues, we will cancel task running on s1?
Yes
this is a bit confusing, so far what i understood is we will be using the sandboxes in case of resource tracking (which will be a periodic and async task more like a separate executor service) to calculate resource consumption and cancel the tasks from the sandboxes based on priority where limits are breaching. but when you say it will come directly on search path, do you mean on every query request, we will see if the sandbox on which this query lands have enough resource to process this and cancel otherwise?
There are two parts to this
The separate periodic resource tracking thread comes into picture only for the second part. Since when a request lands on a node how do you determine which sandbox would this fall into ? To determine that we need access to Sandbox objects and the classification of an inflight request into sandboxes comes on a hot path.
Thanks @kaushalmahi12! okay, if we need to map the incoming requests to sandboxes available in the cluster. then it makes sense to keep the sandbox info in cluster state itself.
@kaushalmahi12 - Thank you for documenting this proposal. Few comments below:
@jainankitk Thanks! for going through this and providing the suggestions here. The regex
will always be a prefix based and length limited.
I like your idea about low and high limit per sandbox to provide some room for the running queries to complete. This will help us maximise efficacy of the system resources. I will capture these details here and in the LLD.
Sorry it's so far unclear to me how sandbox accounting is done differently than a task execution today. Curious how are we planning on terminating a wildcard/regex query that is running complex automaton state transitions at Lucene layers even after query has timeout out. This to me has been one of the major gaps in OpenSearch being able to effective cancel high resource consuming tasks.
Just one suggestion, instead of calling this sandbox, should we refer to it as query_units
or query_groups
as applicable. It's easy to confuse sandbox with experimental
Hi @Bukhtawar The sandbox accounting is done by accumulating the system resource usage per task, for all the tasks which are running associated with the sandbox. I agree that even though the tasks are cancelled at OpenSearch level they might still be running in lucene, which is something we are targeting as part of Hard Cancellation of such tasks (a separate project).
I also agree that Sandbox
in computing context does mean something experimental, I am open to suggestions on naming the construct which can track and enforce the resource limit e,g; ResourceGovernor, ResourceRegulator.
Thanks @Bukhtawar for reviewing and providing your feedback.
Sorry it's so far unclear to me how sandbox accounting is done differently than a task execution today. Curious how are we planning on terminating a wildcard/regex query that is running complex automaton state transitions at Lucene layers even after query has timeout out. This to me has been one of the major gaps in OpenSearch being able to effective cancel high resource consuming tasks.
The primary objective of query sandboxing is to expose right interfaces for creation, accounting and mapping resource usage into different sandboxes. Although inefficient, it is also possible to create sandbox per query which is different than single sandbox (root level sandbox) being tracked today. Only caveat being, the tracking is node level instead of cluster level, which also makes more sense to me since 5% utilization at cluster level could mean 100% utilization on single node. Individual nodes are like atomic units of availability that we need to prevent from going down.
The problem of not acknowledging the cancellation single is separate. The first phase for tracking various types of non-cancellable queries is already in, which will be used to extend the soft cancellation framework. More details on this github issue: https://github.com/opensearch-project/OpenSearch/issues/6953
Thank you for the proposal @kaushalmahi12
I'd like to propose an alternative approach to handling requests that exceed the low watermark threshold. Instead of immediately dropping these requests, have we explored the possibility of implementing a queuing system?
Additionally, I've been thinking about our current method of assigning priorities using integers. Given the potential complexity when dealing with a large number of sandboxes, this system might become challenging for customers to manage, especially when it comes to reprioritizing. To address this, I suggest we consider alternative methods for assigning priorities. Have we looked into options like tag-based prioritization or assigning priorities based on user groups? These methods could offer a more streamlined and intuitive way to manage sandbox priorities, enhancing user experience and operational efficiency.
Thanks for going through this and providing your @kiranprakash154 Yes! we have considered the async completion of cancellable tasks in the future using mechanisms like spill to disk etc.
Since the Sandboxes kind of already prioritised based on user groups, it already disambiguates the priority conflicts. There will never be 100s or 1000s of sandboxes in the system as we will enforce the active sandboxes in the system using a setting. Even from a user's point of view it doesn;t make sense to monitor the 1000s of sandboxes to see anomalies. If we want a certain user group to have higher/lower priority we can always create a new sandbox.
@kaushalmahi12 -- I didn't see anything about how to "bind" requests to a given sandbox. One possible option you may want to consider is @peternied's proposal to support views: https://github.com/opensearch-project/OpenSearch/issues/6181
It feels like associating a sandbox with a view would be a great way to restrict specific users/roles to that sandbox.
@msfroh Thanks for going through it. We will bind the requests to these sandboxes based on request attributes. A request will bind to a sandbox only if all the sandbox attributes are subset of request attributes.
for example lets say we have { username: ["dev*", "oncall*"], indices: ["logs*", "payments*"]} and incoming request is from dev_froh and for index "logs_q1_24" then this request will match the sandbox and will tracked with it.
@kaushalmahi12 -- Have you reviewed this approach with experts on the security plugin? We should make sure that whatever approach we take is aligned with their plans going forward.
I think conceptually "request attributes" might deserve its own conversation, how they are extracted, performant, and accessible to OpenSearch and its plugins. What do you think about starting up another rfc design around this topic?
As a maintainer of security, when considering the user identity, in your example you use 'username', identities aren't expressed consistently in many systems, some have UUIDs, some email addresses, some distigushed names. With the security plugin we have a couple of concepts that we advocate for use: backend roles (from the IDP) and roles (assigned by the Security Plugin via role_mappings). However, these are hard to validate, if you don't have access to the credentials of a user with specific roles how can you be sure it worked correctly.
For why request attributes are of interest to me beyond sandboxing, is we have many different 'resources' in OpenSearch that frankly should have a consistent permission model, but we don't have a way to reason what set of APIs are tied to a logical resource, if we were trying to capture and communicate aspects of a request this could be reusable in many situations. I've done some limited prototyping and considering in this issue [1].
Thanks for the callout @msfroh. I'd much rather centralized management around search experience in singular places that are easy for admins to inspect and easy to validate by that admin, such as with views [2] which is in review.
RFC Link: https://github.com/opensearch-project/OpenSearch/issues/11061
High Level Approach
Tracked System Resources per Sandbox
We are going to track the following resources per sandbox.
These resources will be tracked per task and then accumulated per sandbox
Approaches to calculate the resources and measurement units
CPU
Measurement in terms of percentage: We will measure the CPU usage in terms of percentage and calculation of the CPU usage percentage can be done as following
The above code snippet can only tell the thread CPU usage on a single core as a thread maps to a core while all the tasks associated with a sandbox can consume upto all cores available. Hence to calculate absolute CPU utilisation we will need to tweak the calculation logic from
cpuTime * 1.0f / elapsedTime; ------- (1) cpu usage per core
to
cpuTime * 1.0f / (elapsedTime * Runtime.getRuntime().getAvailableProcessors()) -------- (2) absolute CPU Usage on a machine
This way if we want we can easily map this to vCPU measurement also by accumulating the CPU Usage of all tasks within a sandbox and divide them by 100.
JVM Usage (Memory)
It is not possible to track exact heap usage by a thread as the threads share the JVM memory within a process all it can do is provide the bytes allocated to a thread using ThreadMXBean.getAllocatedBytes method. We will use the percent as a measurement unit for the limits. There could be cases where indexing + search traffic might shoot beyond 100 but for that we are assuming admission control will help us.
we can sum up the allocated bytes per task for the Sandbox and calculate the percent
IO Throughput
We are not going ahead with io as a resource because the only way to get thread level io stats is using /proc/OS_PID/tasks/taskId. As this is pseudo file system while the information is generated on the fly by kernel and is stored in kernel data structures. If the tasks are in thousands or hundreds of thousands, this can impose a significant overhead on performance (As Performance Analyzer plugin has already faced this issue).
Determining the request priority and mapping it to a Sandbox
It is crucial to have the priority information in the request itself to map the tasks to corresponding sandbox. To make the priority concept clear, priority here basically helps in cancellation and not scheduling the tasks. We are not scheduling the tasks in different sandboxes rather associating them and when contention for resources arise we pick the lowest priority sandbox to cancel from.
There are two ways to extract the priority information for each request
Explicit priority will be honored in case of conflicts where both index and explicit priorities are present.
Attributes used for selecting sandboxes for a request
Index name is already part of the search request and security layer propagates the User/Auth related information via thread contexts. We can use this information to map it correctly to sandboxes.
Sandbox selection approach
Sandbox Topology
With the preferred resource consumption enforcement model we will start with a tree-like topology having depth of 1 initially. Although we will keep the structures flexible to support multilevel tree as well. It will help in simpliying the implementation of the resource usage enforcement.
Query Association
Since we want to limit the number of sandboxes in the system it is natural to have the attributes as regexes. And multiple regexes can match a string, this leads to a single query resolving into multiple sandboxes. Which is fine as long as we have proper merging of sandboxes for all attributes which we can address using conjunction operator.
Sandbox Schema
Before we start talking about the approach for resolving sandboxes for a request, we first want to make some key assumptions which will help us understand the approach.
Now given we have multiple attributes to select a sandbox from single request can resolve to multiple sandboxes against each attribute. This leads to multiple lists of sandboxes for us, since there is a regex involved there could be plenty of sandboxes, to narrow down this list we will take the intersection of these lists. If the intersection is empty then request will be assoicated with a catch all sandbox which will be nothing but the root level sandbox.
We can explain this with the following pseudo code
And(Conjunction) vs Or(Disjunction) for sandbox selection
Before coming to a conclusion lets understand this with what information will be available at request level. Since we will have a single user and possibly multiple indices. As conjunction operator by nature filters out more aggressively than disjunction. If we use conjunction the single query will be running associated with fewer sandboxes while with disjunction it could be running with lot more sandboxes It makes more sense to employ conjunction based filtering mechanism here as disjunction makes the query more susceptible for cancellation as it would be running with multiple sandboxes.
Approaches to enforcing Limits on Resource Consumption
There are basically two ways to enforce the resource consumption limits. First one can focuses on allocating or reserving the fixed amount of resource usage for a sandbox while second one can be made flexible to make optimum use of resources available.
Sandbox Lifecycle APIs
We want to enable the cluster admin users of a cluster to create/update/delete sandboxes at runtime. Since this new entity can be created on demand it becomes crucial to limit the active number of sandboxes for optimal performance reasons. We can easily enforce that limit using cluster level setting something like
node.active_sandbox_count
.Sample endpoints
Sandbox storage options
Since this feature will work as core resilient feature, It is important to make the sandboxes durable and safe against fault tolerance. Now the next most obvious question arise is that how and where do we store these sandboxes. Since these will be very limited in number and will not take more than few Kb on the node overall, and we want negligible overhead on retrieving the sandboxes since this code logic will come directly on search path. we have the following choices for the storage
System level Index - Since system level indices are used for storing system wide information but this will not suit our use case because the same data might not be updated on all the nodes due to multiple reasons e,g; Primary-Replica lag, Slower recovery of shards in case of duress. Pros
Cons
Cluster State Metadata - This is a more suitable approach for our use case as this is very fast to reach to other nodes in the cluster (30s to commit new state cluster.publish.timeout ). Since cluster state lies on Having chosen this as our option to store the sandboxes all the API’s will execute on cluster manager nodes since cluster manager maintains the cluster state of the cluster. The cluster state is partially durable as it only makes metadata inside the cluster state durable. This cluster state object metadata survives B/G or even complete failures of a cluster. Pros
Cons
There is one more issue left undiscussed which is how to safeguard the count of sandboxes in face of concurrent create requests. This is something which can be easily taken care of using a CAS compare-and-swap mechanism which reflects the currently created + ongoing sandbox count.
Resource Tracking
We want to keep the sandboxes in tree-like topology to simplify the individual and cumulative resource consumption calculation. Since the overall number of tasks that we will track is not going to vary from what SearchBackpressure does, this should not introduce any additional performance overhead.
This execution of calculating the resource consumption can use either of the following
Cancellation Mechanisms
There are two scenarios for the request not to complete.
The whole purpose of this excercise is able to knock some load down from the sandboxes when system is under duress or any sandbox is trying to oversubscribe the system resources. Here we can leverage the priority to cancel the tasks. Now the problem arise of over cancellation from a low priority sandbox. But isn’t this what we want ? It is what we want to some extent but suppose that a kibana/Dashboards user is firing up these queries and getting throttled for each one of them.
To avoid the overcancellation from a low priority sandbox we can temporarily boost the priority of the sandbox to be higher and relax it back with the time.
Though priority will help us pick a sandbox but what about the query selection from the picked sandbox. We face the following challenges for that
Future Work
Although this feature is great to enforce the assigned limits but this is reactive in nature. This will only kick-in when there is already some fire in the system, and will try its best to bring the system back to normal (at the mercy of task cancellation (which does not abort the running tasks immediately)). We would ideally like to have some sort of pro-active mechanism in place to cancel most of the rogue search queries upfront which can only be possible if we have a robust
Search Query Cost Estimation
framework.Query Cost Estimation
Query cost depends on multiple factors apart from the actual query such as current CPU Load, current IO queue saturation, data distribution etc. Since collecting thread level metrices incur a significant performance ovehead when there are lot of threads in the system e,g; Network IO, Disk IO, and other metrices. We can have a different side car component to collect required metrices for offline processing and improve on cost estimation over time.
This needs further deep dive research of its own, hence we are leaving it for future as of now.
QSB Milestones
We are planning to achieve this feature in 3 milestones.
Milestone 1 - First week of Jan
Milestone 2 - Last week of Jan
Milestone 3 - Last week of Feb