opensearch-project / opensearch-benchmark

OpenSearch Benchmark - a community driven, open source project to run performance tests for OpenSearch
https://opensearch.org/docs/latest/benchmark/
Apache License 2.0
109 stars 75 forks source link

[RFC] Enhancements for OSB Workloads #253

Open gkamat opened 1 year ago

gkamat commented 1 year ago

[RFC] Enhancements for OSB Workloads

Synopsis

This is an RFC for a proposal to enhance the workloads available with OpenSearch Benchmark. It encompasses both including new workloads and adding additional capabilities to the existing workloads.

OpenSearch Benchmark currently ships with a small set of workloads that are used to evaluate the ingestion and query performance of OpenSearch clusters. The limited number of workloads, small size of the associated data corpora and incomplete coverage of OpenSearch capabilities by the supplied queries is a hindrance to effective performance, scale and longevity testing.

This RFC outlines options to address some of these issues.

Motivation

OpenSearch Benchmark (OSB) is a performance testing framework intended for evaluating the performance of OpenSearch. OSB is a workload generator that ships with a set of workloads, which are the scenarios that will be executed by the generator against a target cluster. It is a fork of ESRally and can be used against OpenSearch clusters and Elasticsearch clusters that are v7.10 and under. When OSB was forked, most of the workloads that ship with Rally were also forked.

A common use of OSB, for instance, is to benchmark different OpenSearch versions and compare them. For example, different versions can be installed on the same hardware configuration, and the same workloads run against each. Then the performance of these versions can be compared (for example, as latency vs. throughput curves) to see where one version does better than another, or if the performance of a newer version has regressed in some regard. OSB’s capabilities include running benchmarks, recording the results, tracking system metrics on the target cluster and comparing tests.

Workloads have associated test scenarios (termed test_procedures) that include a set of operations including creating and deleting indices, checking the health of the cluster, merging segments, ingesting data and running queries. From the perspective of OpenSearch performance, the latter two are evidently the ones of most interest.

For a user, the notion of “performance” exists in a certain context. Most users and organizations are interested in how their cluster performs with regard to their own workload, which could be quite different from the workload run by a different user. Organizations are usually reticent in sharing their workloads, which often contain proprietary data. Therefore, the task of coming up with a representative set of workloads for OpenSearch in general is not a trivial one.

The workloads that ship with OSB contain an assorted set of scenarios from various domains, including search and log-analytics, with a range of document types, mappings and queries that the authors of Rally put together from publicly available data corpora. They exercise various use-cases and do provide good insight into OpenSearch performance. However, there are areas that are not covered well, leading to performance regressions such as this one. This RFC focuses on the creation of additional workloads and enhancements to the currently included ones so that they exercise additional scenarios of interest in measuring OpenSearch Performance.

Areas for Enhancement

There are a few major areas of improvement with regard to the workloads available with OSB. Some of them are enumerated below, not necessarily in priority order:

Stakeholders

Proposed Priorities

As indicated above, enhancing the OSB workloads would be a multifaceted, complex, long-term and ongoing endeavor. It will be need to be approached in prioritized phases. Each of the areas outlined above would need in-depth research and analysis before they can be embarked upon. Community engagement will help expedite the process, and indeed is crucial to enlarging the domains covered by the workloads.

With that in mind, this RFC suggests that a couple of the most pressing items in the list can be addressed first: providing a mechanism to increase the size of the data corpora and improving coverage for OpenSearch data types. The solutions proposed below would alleviate the issues substantially in the short-term, but there will need to be on-going enhancements to these capabilities going forward as well.

The priorities of the items to tackle subsequently need to be evaluated and community feedback will be helpful in this regard.

Requirements

Here are the anticipated requirements for the two proposed areas of focus:

Increasing the size of the data corpora

Better coverage of supported data types in OpenSearch

Use Cases

Implementation Notes

These rudimentary notes cover aspects relating to the first two proposed projects for now. More details will be added as they become available while the investigation proceeds.

Better coverage of supported data types in OpenSearch

Increasing the size of the data corpora

How Can You Help?

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

msfroh commented 1 year ago

Better coverage of supported data types in OpenSearch

The bullets in the implementation notes cover a lot of cases, which is great!

I just wanted to call out some issues that I've encountered recently (which I would love to see covered by future benchmarks):

  1. For some of the types, we should include cases where the corresponding field does or does not have doc values, or (where possible) where the field is or is not indexed. For example, there are some cases where there are optimizations based on IndexOrDocValuesQuery. Those optimizations will behave differently depending on whether the field is indexed, has doc values, or both.
  2. I've stumbled across a couple of cases recently where performance is impacted by surprising combinations of "what's in the index", "what's in the query", and "what's in the index that matches the query". Usually, we expect that last category to have the most impact, but not always. For example, if you have a terms aggregation with an include clause with a lot of terms, and you run that aggregation on an index where that field has a lot of terms, it's going to take a long time -- regardless of whether the terms in the query match any of the terms in the index.
  3. I would like to see some cases where we have index sorting enabled, since that can have significant impact on performance.
gkamat commented 1 year ago

@msfroh those are all very good points, that will need to be kept in mind as the coverage of data types is improved. We'll initially focus on adding queries for the data types that are not being exercised currently, and then add others for cases like the ones you mention above.

nandi-github commented 1 year ago

Customer-representative workloads. The workloads included with OSB were inherited from ESRally. They were based on publicly-available data copora and sample sets of queries developed by the authors, to demonstrate how the tool might be used. They don’t necessarily represent real-life user queries.

The title and the last disclaimer are counter to each other. Can you clarify does this represent customer workloads (representative not exact same query per se) so we can estimate the performance in real world ?

bbarani commented 1 year ago

publicly-available data copora

@nandi-github The current workload is generated using public data but doesn't necessarily reflect customer workloads as its not generated using the actual customer data. It's almost impossible to create one workload that will be representative of all customers but a combination of these generic workloads along with other workloads focussed around additional areas mentioned in this RFC would help cover the broader surface area.

gashutos commented 1 year ago

Increasing the size of the data corpora. The data sets associated with the current workloads tend to be rather small. They range from 100 MB to 75 GB, and most are under 10 GB. Such sizes are reasonable for single-node clusters based on small instance types, but are a hindrance when testing real-life multi-node clusters that utilize instances that have substantial memory and disk capacity. Providing a mechanism to increase the size of the corpora will help in this regard.

Some of the field should have very very high cardinality while some of the fields should be repeatitive. Like http_status_code -> low cardinality @timestamp -> high cardinality

gkamat commented 1 year ago

@gashutos, the current implementation for http_logs maintains the cardinality of the existing data corpus that it is derived from. As new capabilities are added to this feature, for instance, with synthesized fields, cardinality will be one of the attributes considered, in addition to the distribution and other characteristics.

msfroh commented 1 year ago

I remembered something from one of the benchmark systems that I worked with previously: It would let you measure both red-line QPS and latency under normal load.

Essentially, it would benchmark in two phases:

  1. Dial up the query traffic until you start to see rejections, then hold the pressure there for about a minute. That's your red-line QPS.
  2. Then run the full benchmark (which would usually take ~30-40 minutes), sending traffic at something like 60% of red-line QPS, to measure latency under "normal load".

Usually a change that reduced the work involved in running a query would help both the red-line QPS and the observed latencies. Something like concurrent search would help observed latency, while hurting red-line QPS (unless it falls back to single-threaded under heavy load).

I don't know enough about OSB to know if doing something like the above would require a special workload or changes to OSB itself.

nandi-github commented 1 year ago

@msfroh Good suggestion. I agree with your comments. It still need the query traffic pattern to be defined otherwise it is too subjective. Once the query pattern is defined. 1) Highest QPS with no rejections for 3 mins (TBD) 2) Highest QPS for an agreed LATENCY.

bbarani commented 1 year ago

@msfroh Are you looking for gradual ramp-up feature to identify the max threshold of a cluster especially when running with multiple concurrent clients?

msfroh commented 1 year ago

Are you looking for gradual ramp-up feature to identify the max threshold of a cluster especially when running with multiple concurrent clients?

That's part of it. Step 1 is to identify that max QPS threshold (i.e. what is the capacity of the cluster?)

Then step 2 is to run traffic (with multiple clients) at some proportion of max QPS, to simulate "reasonable load" (since a well-run system won't run at 100% all the time -- what happens when a data center goes down?) to measure latency under "normal conditions".

As @nandi-github mentioned above:

It still need the query traffic pattern to be defined otherwise it is too subjective.

In the project where we were doing this kind of benchmarking, we had hundreds of thousands of queries from our production logs. While different queries had different latency and put different load on the system, the overall pool was big enough that any random 10k queries probably had about the same distribution of costs.

gkamat commented 1 year ago

@msfroh, this is a feature other folks have been interested in as well, although the nuances of how it would work have been described in various ways. There is an issue that touches on adding the capability to OSB to auto-scale and ramp up the intensity of the workload. As you mentioned, this would need to take into account the differences between operations. For instance, a complex aggregation would behave quite differently than a term query. Scaling up will also need to take into account the capabilities of distributed workload generation.

Once the features requested in the issue above are implemented, it will be the right point to address your request.

jmazanec15 commented 12 months ago

Related to #199 and https://github.com/opensearch-project/neural-search/issues/430, I think we are adding a lot more options in OpenSearch recently for improving search relevance. Users can use neural queries, neural sparse queries, custom re-rankers, etc. in order to achieve better quality results. On top of this, for the ml side, they may have several models to choose from for each configurations.

That being said, with all of those options, it can be really hard for users to converge on an optimal configuration for search. While OSB allows users to get metrics around performance (latency, throughput etc.), users are unable to determine the benefits in search relevance they are getting. I know that it may not make sense to measure relevance information while running throughput/latency focused steps due to overhead, but I think that it would make sense to incorporate certain steps dedicated to relevance in the dedicated search workloads. That way, a user only needs to run one workload in order to understand tradeoffs wrt to performance and relevance.

Please let me know if it makes sense to discuss in a separate issue.

bbarani commented 11 months ago

Referencing issue related to autoscale feature - https://github.com/opensearch-project/opensearch-benchmark/issues/373