[RFC] High level Vision for Shard Management

himshikha commented 1 year ago

Problem Statement

Shard management in OpenSearch is currently sub-optimal in terms of allocation and overall shard count/size handling. Shard allocation and rebalancing currently tries to balance the number of shards on each node (in addition to other deciders), but does not take into account shard sizes, hot shards and node performance metrics like disk and CPU usage. This often leads to shard skew, some nodes getting overloaded with large shards or with high request rate.

Shard sizing and replica count is also an issue where users often exceed the optimal shard size/count limits, which impacts cluster performance and makes cluster updates slow. Not having enough replicas can impact search performance for high workloads.

Additionally, since all allocation logic resides in core right now, it makes it difficult for new folks to understand where to begin and contribute for a particular logical piece. The logic is also getting tangled up with cluster manager code making it difficult to work with both components and creating unnecessary dependencies in the codebase.

Goal

We want to optimize shard placement and rebalancing in order to achieve improved cluster performance. We also propose to time the shard migrations such that it has no or negligible impact to the cluster when it happens. End goal is for OpenSearch to have capabilities to auto manage most aspects of shard management, while providing pluggable solutions which users can customize to their requirements. We want to have better shard allocation which takes into account performance metrics at node, shard and index level, and provides most optimal allocation for each shard. This would solve shard skew and hot nodes/hot shards related issues and improve cluster performance. For shard sizes, we can provide pluggable solutions where users doesn’t have to care about these aspects and OpenSearch can auto manage shard sizes and replicas. We can provide adaptable solutions for faster shard relocations/recoveries which will help with cluster updates and shard rebalances.

Proposal

We are proposing to have pluggable components for overall Shard Management in OpenSearch. There can be multiple components based on how much we want to break it down. We will extract out Shard allocation logic (deciders, balancers) into a separate component, making it easier to extend and add new allocation deciders based on requirement. Deciders can be further broken down based on resources they take decisions on.

Allocation logic will be improved by adding new deciders which use shard/node heat, shard sizes, disk throughput, CPU and other metrics to give optimal allocation decision. New metrics can be onboarded to existing MetricsCollector to accomplish this as required.

Having these components out of core and as separate components also contribute to efforts to modularize the existing components in OpenSearch. This will help in maintaining the shard allocation logic and allow easy onboarding and contributions from developers, with limited feature scope.

ShardManager

This component will be responsible for overall management of shards, including shard allocation and rebalancing decisions. It will host the allocation deciders and balancers as a separate component (these can be removed from core post this change). Allocation will be enhanced to add new deciders, which use different resource utilization metrics as decision criteria and give an optimal allocation. Users can add their own deciders as per requirements since this will be a pluggable component.

This can be extended for improving cluster performance by auto-tuning shard sizes, replica counts based on indexing and search workloads. Eventually, users will only need to be aware of indexes and everything else gets handled by OpenSearch automatically. Shard relocation can be improved by auto tuning recovery settings using the cluster metrics which will help speed up cluster updates requiring shard migrations.

These functionalities can be part of the ShardManager component or can be split in multiple modules for allocation, auto tuning, etc.

HeatMapManager

This will be a new component responsible for creating HeatMap at node level, shard level, index level by using the metrics available from the Metrics reader (this could be PA reader). MetricsCollector (this can be PA collector) will be enhanced to emit new metrics at shard and node levels, if they don't already exist. It will also be aware of the cluster metadata required to map shards to nodes and indexes to aggregate metrics at required levels. HeatMap generation will use shard level metrics like shard count, shard size, request latency, rejections, disk throughput used and node level metrics like CPU, JVM, disk space, overall disk throughput and network bandwidth. It will also be capable of maintaining historical heat data for better decision making.

This will be used by ShardManager then to include shard/node heat as a criteria for allocation and rebalancing decision.

Phases

We will be taking a phased out approach to accomplish this. For each phase we will publish more detailed RFC/design docs.

Phase 1: ShardManager and HeatMapManager as pluggable components with a default implementation. Consuming the heatmap in allocation decision as a new decider. Phase 2: ShardManager extension with all deciders/balancers moved out of core. Additional features to be added will be further phased out.

Next Steps

Next steps will be to come up with high level design based on vision and feedback from this proposal. We will then create meta issues to go in more depth on the components involved to continue the detailed design.

gashutos commented 1 year ago

Great vision ! Few input to goals of what I think, let me know your thoughts. @himshikha

Goal

Would it make sense to add auto tuning of shard split along with new shard allocation in this proposed goal ? Search performance gets slower with high shard size. And for OpenSearch index, we cant change the number of shards (primary) after index is created. I would propose if we can take additional goal here to split the existing shards based on ShardHeatManager metrics ? We might want to introduce offline shard split where we minimize the index open downtime. This would benefit existing domains a lot along with new shard (primary) allocation.

ShardManager

Will it be seperate node type ? Or it will be part of master only ?

HeatMapManager

What's motive behind this as seperate component ? Can this be included in ShardManager itself ? The reason is, metrics are already emitted by other components, getting stored in other components. Edit -> Simple deisgn for overall OpenSearch with fewer components :)

himshikha commented 1 year ago

Thanks for sharing your thoughts @gashutos

Would it make sense to add auto tuning of shard split along with new shard allocation in this proposed goal ?

That is one of the goals we are targetting when we say auto-manage shard sizes. This will include auto-split or auto-merge based on different metrics.

Will it be seperate node type ? Or it will be part of master only ?

We are envisioning a separate component, may not be a separate node type/role, but a plugin/module. Details are yet to be defined.

What's motive behind this as seperate component ? Can this be included in ShardManager itself ? The reason is, metrics are already emitted by other components, getting stored in other components.

Motive is to keep separate pluggable components which are easy to extend. This will consume metrics from existing components and come up with something like heat score. Now how we calculate this heat score could vary based on usecase and having a separate component with clearly defined boundaries will enable us to switch heat score logic without having to worry about how allocation uses it. That being said, nothing is set in stone right now and how final design looks like might change based on discussion and feedback.

Gaganjuneja commented 1 year ago

@himshikha Thanks for putting this up. Could you please elaborate a bit, How will ShardManager coordinate/connect with other nodes and collect the metrics data (through node stats?)?

dblock commented 1 year ago

This is a good proposal. I'd be interested in reading what "optimal shard allocaiton" may look like. I think it's important to assert some kind of measurable thesis that we can then compare various shard allocation strategies with, and eventually settle on new defaults.

ViggoC commented 1 year ago

I am very excited to see this proposal. The balancer based on shard count often leads to hotspot problem in our cluster.

I think a shard manager based on real time shard / node load is the direction worthy of in-depth research. The core problem is how to describe the load of a shard. One idea is to build of model by machine leaning. But the model generalization might be a big problem for OpenSearch to apply this model to clusters of various businesses. It'd be better to provide a API the fine tune the shard load model by real time metrics.

ViggoC commented 1 year ago

I think we also need to consider the performance during rebalancing. Currently, all metadata changes trigger rerouting in the cluster, which is slow in a large-scale cluster and can block subsequent master operations, affecting service stability.

himshikha commented 1 year ago

@dblock Agreed that we need a measure of success where we can say we are able to improve cluster performance with proposed changes. Shard/Node heat itself could be a metric to track, in addition to cluster performance metrics like CPU, Disk IOPS, JVM, etc. We can define a series of search/indexing load inducing tests and track if the cluster remains in a more optimal state throughout, and is able to rebalance itself with the new allocation strategies.

himshikha commented 1 year ago

@Gaganjuneja The how part of things is not yet frozen. At a high level, we will try to use existing metric emission/consumption mechanisms and build on those rather than trying to create something new wherever possible.

ViggoC commented 1 year ago

I am not sure if the scope of this proposal includes the allocation of newly created shards. It is difficult to evaluate its load without historical data for indexes that have not been created yet.

shwetathareja commented 1 year ago

Thanks @himshikha for the proposal.

Heat based shard allocation would be a great improvement to current shard allocation which is mostly based on counts and doesnt consider shards resource footprints.

We will extract out Shard allocation logic (deciders, balancers) into a separate component, making it easier to extend and add new allocation deciders based on requirement. Deciders can be further broken down based on resources they take decisions on.

Have you looked into the ClusterPlugin. The balancer and deciders are already configurable.

https://github.com/opensearch-project/OpenSearch/blob/84be8c9207cf1153b2eb8dfaf77cf737959781cc/server/src/main/java/org/opensearch/plugins/ClusterPlugin.java#L61-L85

shwetathareja commented 1 year ago

This can be extended for improving cluster performance by auto-tuning shard sizes, replica counts based on indexing and search workloads.

I am not clear what you are proposing this to be part of the core or this would be done via ISM policies?

himshikha commented 1 year ago

@shwetathareja This would be either part of the ShardManager component, or we can create separate pluggable components for these functionalities. The idea is to let users configure if they want to enable a feature based on trade-off between performance improvement vs resource utilization.

shwetathareja commented 1 year ago

@himshikha why do we need to create a new component, is there a reason why they can't configure via ISM?

ViggoC commented 1 year ago

There is a performance analyzer plugin which collect cpu usage and heap usage at shard level. What do you think of referring to the data collected by this plugin?

khushbr commented 11 months ago

Thank you for the proposal @himshikha ; An Adaptive, resource-aware shard management layer will add to the data availability and resiliency in OpenSearch.

Shard Allocation and Balancing is an optimization problem. Given the workload in OpenSearch can be highly dynamic (with microbursts present), How does the algorithm handle trade-off between frequent Shard Movement and optimal performance ?
There can be multiple levels of performance hot-spots in OpenSearch - Node, Index, Shard and Doc. Can you elaborate more on how this framework will solve these bottlenecks.

opensearch-project / OpenSearch