[RFC]: Scale-Up Improvements on Single Load Generation Host

IanHoang commented 4 months ago

Synopsis

This RFC is intended to address OpenSearch Benchmark’s (OSB) limitations pertaining to operation at large scale. Several users have reported that OSB performance tests do not scale when a large number of client threads is specified. Overcoming this limitation is crucial for the OpenSearch community as it will unblock these users and other stakeholders and potentially lead to the development of new features within OpenSearch. This RFC proposes an investigation into the scaling limitations and subsequent options to overcome them, which may include modifications to OSB’s client architecture.

Motivation

By specifying certain workload parameters, OSB users can alter characteristics of a benchmark. OSB has a clients parameter that allows users to specify the number of parallel threads to perform a task or operation. This parameter can be specified by setting bulk_indexing_clients and search_clients; these simulate the number of clients that issue indexing and search requests respectively. By default, workloads packaged with OSB have bulk_indexing_clients set to 8 and search_clients set to 1 unless the user specifies otherwise. These clients all run in parallel on the load generation host where OSB is installed and invoked. OSB achieves this by leveraging the Thespianpy library, an actor model framework.

When a user wants to increase the intensity of load imposed on the target system-under-test (an OpenSearch cluster or a Serverless implementation), the natural technique is to increase the number of clients by utilizing one of the parameters above. By doing this, users can simulate the traffic patterns seen in their production environments and better understand their cluster’s limitations.

However, many users have reported that OSB encounters scaling limitations when the number of clients is increased beyond a certain level. Such bottlenecks may result from OSB’s client architecture design, such asthis one that details how OSB cannot scale out the number of clients or is unable to achieve certain throughput levels due to design constraints. Users have claimed they can scale up to 16 clients successfully but OSB performance begins to degrade once they go beyond 32 clients. There have also been reports that the current client architecture might not use the load generation host’s resources effectively. To combat these pain points, some users have found makeshift ways to get around these limitations and discovered that such workarounds can lead to better resource utilization and workload performance.

Evidently, even a highly efficient application running on a single load generation host will cease to scale beyond a certain point, when the resources available are all consumed by the workload. At that point, it will become necessary to scale-up by using a beefier instance, or to scale-out by adding additional load-generation hosts that operate in parallel. With regard to the latter, Distributed Workload Generation (DWG) is a feature that comes with OSB and coordinates a group of load generation hosts to drive load to the OpenSearch cluster. This feature, that uses a scaled out number of hosts, is intended for use in the scenario described above. However, the feature has not been thoroughly tested and the exact scenarios of when it should be used is not well understood.

This RFC was inspired by these user pain points and focuses on understanding which OSB components are involved in simulating clients, what the limitations are with these components, and which changes can be made to remove such limitations, thereby making the use of a single load generation host’s resources more efficient.

There will be a separate RFC to address DWG. This RFC proposes that there will be two phases related to scale testing. The first phase will focus on making OSB scale as well as possible and the second focusing on DWG, which will have its own RFC. The first phase will consist of identifying limitations, verifying workarounds, tracking down bottlenecks or causes of the limitations, overcoming bottlenecks, and publishing info on discoveries and actions taken.

This RFC also provides opportunity in determining if OSB should support other language clients for OpenSearch. Since OSB is primarily based in Python, there have been questions on if the Python GIL, which is known to prevent parallelism, or Python’s Async IO library limits OSB’s scalability in search clients. If Python is a limitation, OSB might need to be rearchitected to become more modular and be able to use other OpenSearch language clients (such as Go, Rust, and Java).

Areas of Interest

Since this RFC is focused enhancing OSB performance at scale, our areas of interest will be on the Worker Coordinator Actor and the Worker Actor(s) since they are primarily involved in scaling out clients which consume the load generation host’s resources. Specifically, we’re interested in analyzing the components’ code, stress testing them, and seeing how they perform under various conditions to identify any shortcomings. For more information on why we chose these areas of interest, see the Appendix.

Stakeholders

OpenSearch users who want to test large data corpora and at high-load intensities and commercial serverless offerings of OpenSearch that are interested in simulating high-intensity loads to test their service’s efficacy in scaling out
OpenSearch users and developers who want to simulate clients seen in their production environment or performance test new features
Commercial managed services that track OpenSearch performance for the releases they offer against large clusters
Corporate benchmarking teams, who may be interested in benchmarking their use-cases with OpenSearch against other options

Proposed Priorities

This RFC proposes a separation of associated activities into two sequential investigations:

The first will focus on the scaling aspects of OSB on a single load generation host.
The second will focus on DWG improvements in a distributed load generation deployment, and will have its own RFC.

Community engagement is invited and will help with both phases. Scale testing OSB’s client architecture will need to be thorough to ensure we are covering enough scope and feedback on this front will be helpful. More details for the first phase are explored in the following section.

Requirements

Investigating and improving OSB’s client architecture can be broken down into several steps:

Identify current limits: The OSB community is aware that there are limitations in terms of scaling clients within OSB, but is unsure of what those exact limitations are. A majority of the time, OSB is used as a single load generation host to emulate the performance of a fleet of nodes. Therefore, a performance comparison between a cluster of nodes, each with OSB set to a single client, and a single node with OSB set to several clients will help uncover what those exact limitations are.
Identify workarounds if possible: After understanding the limitations, we will determine if there are any quick workarounds that users can resort to to alleviate scaling limitations, while work progresses on long-term solutions.
Investigate bottlenecks (or causes of limitations): For the limitations discovered in step 1, we will need to investigate the bottlenecks in more depth and identify causes.
Overcome bottlenecks: Identify and implement appropriate solutions on how to resolve bottlenecks and remove limitations
Publish info on discoveries and actions taken: After all the work has been done, we should summarize our findings and solutions and ensure that OSB has been appropriately updated to handle scaling better.

Subsequent issues will be created to address these requirements and elaborate on implementation details.

Use Cases

As an OpenSearch user and developer, I want to be able to accurately simulate the number of clients I anticipate to see or have seen in my production environment
As an OpenSearch user, I would like to test large workloads at high intensities
As an OpenSearch user, I would like to have reliable metrics when simulating clients
As an OpenSearch user and developer, I want to be able to scale out the number of clients or have OSB auto-scale its clients to reach a specific throughput levels
As an OpenSearch developer, I would like to perform scale tests to determine how my OpenSearch cluster performs at various scales

How Can You Help?

Any general comments about the overall direction are welcome.
Indicating whether the areas identified above for investigating and improving OSB’s client architecture and performance include your scenarios and use-cases will be helpful in prioritizing them.
Provide early feedback when analyzing results from the investigation and testing out new features to enhance OSB’s client architecture as soon as they become available.
Help out on the implementation! Check out the issues page for work that is ready to be picked up.

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

Appendix

Benchmarking Process Under the Hood

OSB uses a group of actors that are based on the thespian.actors from Thespianpy library, an actor model framework available in Python. These actors coordinate with one another and can be viewed as the components that make up OSB’s benchmarking process.

Each actor has its own responsibility. For example, the Benchmark Actor starts the overall benchmarking process and calls upon the Builder Actor to determine if there is a provisioned OpenSearch cluster. The Worker Coordinator Actor is called upon to prepare the benchmark by communicating with the Workload Preparation Actor. To supply load to the OpenSearch cluster, the Worker Coordinator Actor provisions a number of Worker Actors based on the number of CPU cores in the load generation host. Based on the number of clients (such as bulk_indexing_clients and search_clients) set in the workload or specified by the user, Worker Actors will be allocated a number of clients and steps (also known as tasks or operations in workloads) to execute. These workers will split up the work and each simulate N number of clients. Once the workers and their respective clients have finished executing a step, they will reconvene at a joinpoint before the Worker Coordinator Actor informs them to proceed to the next step.

gkamat commented 3 months ago

This is a timely RFC. Understanding how well OSB scales is essential to carrying out accurate and meaningful performance tests. A few comments:

This investigation should not be restricted to search clients. Ingestion is just as important a phase and for implementations like Serverless, ingestion rate is an important factor in determining when to scale out the compute resources.
The ASG strategy for comparison is a good technique to evaluate the efficacy of OSB scaling. It will be helpful to lay out an overview of the modus operandi in the synopsis, rather than in the description of the experiments. Ideally, the detailed description should be moved out into the child tasks rather than being included here (there appears to be some duplication currently.)
Would it make sense to use OSB with a single client on each ASG node rather than using an opensearch-py client? This may be helpful for ingestion and perhaps even for queries, rather than setting up the tooling for a script and an ingestion corpus.
With regard to worker actors, there is also the question of the right number of actors per core. Perhaps more than 1 actor per core may result in better utilization and improved scaling. This is an area that deserves some attention.
Milestones should be moved out of the RFC into child tasks. The meta issue for this project should be noted at the beginning of the RFC for easy reference.
Effort and duration should be tracked separately, since schedules can change and there may be multiple participants working on this effort. A general high-level estimate should suffice, even in the child tasks, rather than precise dates.
The concepts of "validation" and "limitations" can do with some elaboration. In some sense, all the questions are interlinked. If, for "validation", the intent is to use opensearch-py clients and compare against OSB clients (to estimate the overhead of OSB), that should be indicated accordingly. As noted above, setting up the comparative testbeds in this case will be more involved. Most likely, the first set of experiments will provide guidance on when OSB running with multiple clients fails to scale and the second set will provide insight into resource requirements.

IanHoang commented 3 months ago

I have updated the RFC to be more high-level and concise based on the feedback received.

opensearch-project / opensearch-benchmark