gkamat commented 1 year ago

[RFC] Enhancements for OSB Workloads

Synopsis

This is an RFC for a proposal to enhance the workloads available with OpenSearch Benchmark. It encompasses both including new workloads and adding additional capabilities to the existing workloads.

OpenSearch Benchmark currently ships with a small set of workloads that are used to evaluate the ingestion and query performance of OpenSearch clusters. The limited number of workloads, small size of the associated data corpora and incomplete coverage of OpenSearch capabilities by the supplied queries is a hindrance to effective performance, scale and longevity testing.

This RFC outlines options to address some of these issues.

Motivation

OpenSearch Benchmark (OSB) is a performance testing framework intended for evaluating the performance of OpenSearch. OSB is a workload generator that ships with a set of workloads, which are the scenarios that will be executed by the generator against a target cluster. It is a fork of ESRally and can be used against OpenSearch clusters and Elasticsearch clusters that are v7.10 and under. When OSB was forked, most of the workloads that ship with Rally were also forked.

A common use of OSB, for instance, is to benchmark different OpenSearch versions and compare them. For example, different versions can be installed on the same hardware configuration, and the same workloads run against each. Then the performance of these versions can be compared (for example, as latency vs. throughput curves) to see where one version does better than another, or if the performance of a newer version has regressed in some regard. OSB’s capabilities include running benchmarks, recording the results, tracking system metrics on the target cluster and comparing tests.

Workloads have associated test scenarios (termed test_procedures) that include a set of operations including creating and deleting indices, checking the health of the cluster, merging segments, ingesting data and running queries. From the perspective of OpenSearch performance, the latter two are evidently the ones of most interest.

For a user, the notion of “performance” exists in a certain context. Most users and organizations are interested in how their cluster performs with regard to their own workload, which could be quite different from the workload run by a different user. Organizations are usually reticent in sharing their workloads, which often contain proprietary data. Therefore, the task of coming up with a representative set of workloads for OpenSearch in general is not a trivial one.

The workloads that ship with OSB contain an assorted set of scenarios from various domains, including search and log-analytics, with a range of document types, mappings and queries that the authors of Rally put together from publicly available data corpora. They exercise various use-cases and do provide good insight into OpenSearch performance. However, there are areas that are not covered well, leading to performance regressions such as this one. This RFC focuses on the creation of additional workloads and enhancements to the currently included ones so that they exercise additional scenarios of interest in measuring OpenSearch Performance.

Areas for Enhancement

There are a few major areas of improvement with regard to the workloads available with OSB. Some of them are enumerated below, not necessarily in priority order:

Increasing the size of the data corpora. The data sets associated with the current workloads tend to be rather small. They range from 100 MB to 75 GB, and most are under 10 GB. Such sizes are reasonable for single-node clusters based on small instance types, but are a hindrance when testing real-life multi-node clusters that utilize instances that have substantial memory and disk capacity. Providing a mechanism to increase the size of the corpora will help in this regard.
Better coverage of supported data types in OpenSearch. OpenSearch supports several data types such as byte, short, float, double, geopoint and so on. Backend performance of operations on these data types can change, for instance as a consequence of sort optimizations. However, the existing workloads don’t exercise all these data types, and consequently, regressions might escape attention, for instance, see this issue.
Longer running workloads. Performance regressions are not limited to throughput and latency degradation. Issues such as memory leaks and increased CPU utilization may become apparent only after workloads have run for a long time. The usual scenario with the included workloads is to run the default test procedure, which ingests data and runs a sequence of queries; this usually terminates in a few hours at most. Being able to easily run more query iterations after a single ingestion will help surface such issues. Increasing the size of the ingested corpora, repeating the ingestions over multiple days and extracting thread dumps on the target cluster may be areas to consider.
Quick-running workloads. Identifying a set of representative workloads and/or parameters that enable a quick read of performance will be useful for several reasons, including being able to gate check-ins of changes. One way might be to restore from snapshot and run only queries.
New workloads covering additional domains. The set of included workloads has remained static since the release of OSB. To keep the tool current and vital, it is essential to add workloads that represent new areas of interest such as machine learning, observability and so on. This item is not solely the province of the benchmarking team; it will require engagement with the extended open-source community to bootstrap.
Cataloging of the features exercised and filling in any gaps. The included workloads cover areas such as ingestion, log-analytics and search, application domains such as web-server logs, geopoint searching, textual searching, etc. However, there is no documentation characterizing the workloads and enumerating the types of queries that each features. Doing this will help identify specific queries that are missing from the portfolio and that can be filled in.
Ability to easily add custom workloads. A common question from users of OSB is regarding how they can add new workloads, especially ones that model their specific use-case. Documentation regarding how to do this is slim. Improving this, and providing an example of actual steps involved will go a long way in addressing this pain point.
Customer-representative workloads. The workloads included with OSB were inherited from ESRally. They were based on publicly-available data copora and sample sets of queries developed by the authors, to demonstrate how the tool might be used. They don’t necessarily represent real-life user queries. Performing a deeper exploration of the common use-cases of commercial users of OpenSearch will provide insight into how the workloads can be made more relevant to them.
Synthetic workloads. It is not always possible to model a workload based on an existing workload, for instance, a user might be interested in modifying their schema to evaluate what-if scenarios. Finding appropriate documents to ingest may not be straightforward either. For such cases, synthetic workloads can be the solution. Based on a schema definition and a specification of the desired characteristics and target size, a document generator can produce a large data set. Templates could be used to generate appropriate queries as well.
Support for non-traditional scenarios. Unlike a few years ago, OpenSearch clusters may not fit the traditional mold of a static number of data nodes equipped with on-disk storage. Newer paradigms like tiered storage and serverless operation are becoming common. Exercising performance testing for them require workloads that support such scenarios. Some of the facets above like large corpora apply here as well. But, they also need specific support — for instance, operations such as checking the cluster health may not be available with a serverless option.
Workloads for non-OpenSearch compatible services. OSB is tied to the OpenSearch API and the current workloads operate only with OSB. However, there are other applications and services that are used for log-analytics and search, such as Splunk and Loki. Platform-neutral workloads will need to be an area of focus at some point in the future.

Stakeholders

OpenSearch users and developers, the primary consumers of any new and enhanced workloads
Opensearch developers implementing new features, who may be interested in performance testing them
Commercial managed services that track OpenSearch performance for the releases they offer
Corporate benchmarking teams, who may be interested in benchmarking their use-cases with OpenSearch against other options
OSB developers, responsible for implementing the workload enhancements

Proposed Priorities

As indicated above, enhancing the OSB workloads would be a multifaceted, complex, long-term and ongoing endeavor. It will be need to be approached in prioritized phases. Each of the areas outlined above would need in-depth research and analysis before they can be embarked upon. Community engagement will help expedite the process, and indeed is crucial to enlarging the domains covered by the workloads.

With that in mind, this RFC suggests that a couple of the most pressing items in the list can be addressed first: providing a mechanism to increase the size of the data corpora and improving coverage for OpenSearch data types. The solutions proposed below would alleviate the issues substantially in the short-term, but there will need to be on-going enhancements to these capabilities going forward as well.

The priorities of the items to tackle subsequently need to be evaluated and community feedback will be helpful in this regard.

Requirements

Here are the anticipated requirements for the two proposed areas of focus:

Increasing the size of the data corpora

The objective would be to provide a capability to increase the size of the data corpus associated with one workload initially; the timeseries workload http_logs is the proposed one
The user should be able to carry out performance tests with a data corpus in the size range of 100 GB to 250 GB, and perhaps larger.
Being able to control the timestamp sequence granularity in the generated documents will be helpful
Synthetic documents would be one possible approach, but duplication of the existing corpora will be a quick initial alternative
Queries that mesh with the generated timestamps will help control the subset of documents addressed in searches.

Better coverage of supported data types in OpenSearch

Either a new workload or modification of one of the existing workloads that operates on documents featuring additional data types. Initially, timestamp, byte, short, float and double would be candidates for this exercise.
Documents that are to be ingested would include fields of these types.
Queries would carry out common operations on these data types such as basic searches, range operations, ascending and descending sorts and aggregations.
Sorts could be on numeric or string types. Cardinality would be one of the facets considered.
Parameterizing the queries in one of the existing workloads may be one approach of tacking this item, but this needs to be investigated in more depth.
Aggregation and grouping should take into account the number of buckets. Ranges could vary between 10 and 10K groups, increasing by orders of magnitude.

Use Cases

As an OpenSearch developer, I want to be able to track query performance for operations on supported data types such as bytes, integers, floating-point types, geo types, etc. I would like to be notified of regressions and be able to compare and co-relate performance of these different types.
As an OpenSearch user, I would like to have information on appropriate data types to use for my data models, that best fit my use-case and desired performance constraints.
As a OpenSearch engineer with Lucene expertise, I want to verify if low-level Lucene optimizations are surfaced in OpenSearch.
An an OpenSearch developer, I would like to run scale tests.
An an OpenSearch developer, I want to exercise scenarios for scenarios like UltraWarm, segment replication or parallelizing queries on Lucene segments, that require adequate amounts of data to provide meaningful data.
As an OpenSearch user, I want to run performance tests against multi-node clusters with replicated shards. I expect to have ingestion and query run for reasonable durations rather than completing instantaneously.
As an OpenSearch user, I want to be able to create a small workload that demonstrates a new performance problem or a regression.

Implementation Notes

These rudimentary notes cover aspects relating to the first two proposed projects for now. More details will be added as they become available while the investigation proceeds.

Better coverage of supported data types in OpenSearch

Cataloging of the existing workloads and the data types they support to identify the data types being covered currently
Identifying one or two workloads amenable to adding additional data types
Updating the mapping types with the new types
Modifying the documents in the target workload to add new fields corresponding to the new types
Parameterizing the queries to exercise the new types

Increasing the size of the data corpora

Initially will be a log-analytics workload based on http_logs.
Duplication of the corpora a configurable number of times in an initial phase
Rewriting one or more of the fields to eliminate duplicate documents
Generate the offset file in place
Update the index file specification for the workload
Update queries to work with the expanded corpora

How Can You Help?

Any general comments about the overall direction are welcome.
Indicating whether the areas identified above for workload enhancement include your scenarios and use-cases will be helpful in prioritizing them.
Provide early feedback by testing the new workload features as they become available.
Help out on the implementation! Check out the issues page for work that is ready to be picked up.

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

[] #254

msfroh commented 1 year ago

Better coverage of supported data types in OpenSearch

The bullets in the implementation notes cover a lot of cases, which is great!

I just wanted to call out some issues that I've encountered recently (which I would love to see covered by future benchmarks):

For some of the types, we should include cases where the corresponding field does or does not have doc values, or (where possible) where the field is or is not indexed. For example, there are some cases where there are optimizations based on IndexOrDocValuesQuery. Those optimizations will behave differently depending on whether the field is indexed, has doc values, or both.
I've stumbled across a couple of cases recently where performance is impacted by surprising combinations of "what's in the index", "what's in the query", and "what's in the index that matches the query". Usually, we expect that last category to have the most impact, but not always. For example, if you have a terms aggregation with an include clause with a lot of terms, and you run that aggregation on an index where that field has a lot of terms, it's going to take a long time -- regardless of whether the terms in the query match any of the terms in the index.
I would like to see some cases where we have index sorting enabled, since that can have significant impact on performance.

gkamat commented 1 year ago

@msfroh those are all very good points, that will need to be kept in mind as the coverage of data types is improved. We'll initially focus on adding queries for the data types that are not being exercised currently, and then add others for cases like the ones you mention above.

nandi-github commented 1 year ago

Customer-representative workloads. The workloads included with OSB were inherited from ESRally. They were based on publicly-available data copora and sample sets of queries developed by the authors, to demonstrate how the tool might be used. They don’t necessarily represent real-life user queries.

The title and the last disclaimer are counter to each other. Can you clarify does this represent customer workloads (representative not exact same query per se) so we can estimate the performance in real world ?

bbarani commented 1 year ago

publicly-available data copora

@nandi-github The current workload is generated using public data but doesn't necessarily reflect customer workloads as its not generated using the actual customer data. It's almost impossible to create one workload that will be representative of all customers but a combination of these generic workloads along with other workloads focussed around additional areas mentioned in this RFC would help cover the broader surface area.

gashutos commented 1 year ago

Increasing the size of the data corpora. The data sets associated with the current workloads tend to be rather small. They range from 100 MB to 75 GB, and most are under 10 GB. Such sizes are reasonable for single-node clusters based on small instance types, but are a hindrance when testing real-life multi-node clusters that utilize instances that have substantial memory and disk capacity. Providing a mechanism to increase the size of the corpora will help in this regard.

Some of the field should have very very high cardinality while some of the fields should be repeatitive. Like http_status_code -> low cardinality @timestamp -> high cardinality

gkamat commented 1 year ago

@gashutos, the current implementation for http_logs maintains the cardinality of the existing data corpus that it is derived from. As new capabilities are added to this feature, for instance, with synthesized fields, cardinality will be one of the attributes considered, in addition to the distribution and other characteristics.

msfroh commented 1 year ago

I remembered something from one of the benchmark systems that I worked with previously: It would let you measure both red-line QPS and latency under normal load.

Essentially, it would benchmark in two phases:

Dial up the query traffic until you start to see rejections, then hold the pressure there for about a minute. That's your red-line QPS.
Then run the full benchmark (which would usually take ~30-40 minutes), sending traffic at something like 60% of red-line QPS, to measure latency under "normal load".

Usually a change that reduced the work involved in running a query would help both the red-line QPS and the observed latencies. Something like concurrent search would help observed latency, while hurting red-line QPS (unless it falls back to single-threaded under heavy load).

I don't know enough about OSB to know if doing something like the above would require a special workload or changes to OSB itself.

nandi-github commented 1 year ago

@msfroh Good suggestion. I agree with your comments. It still need the query traffic pattern to be defined otherwise it is too subjective. Once the query pattern is defined. 1) Highest QPS with no rejections for 3 mins (TBD) 2) Highest QPS for an agreed LATENCY.

bbarani commented 1 year ago

@msfroh Are you looking for gradual ramp-up feature to identify the max threshold of a cluster especially when running with multiple concurrent clients?

msfroh commented 1 year ago

Are you looking for gradual ramp-up feature to identify the max threshold of a cluster especially when running with multiple concurrent clients?

That's part of it. Step 1 is to identify that max QPS threshold (i.e. what is the capacity of the cluster?)

Then step 2 is to run traffic (with multiple clients) at some proportion of max QPS, to simulate "reasonable load" (since a well-run system won't run at 100% all the time -- what happens when a data center goes down?) to measure latency under "normal conditions".

As @nandi-github mentioned above:

It still need the query traffic pattern to be defined otherwise it is too subjective.

In the project where we were doing this kind of benchmarking, we had hundreds of thousands of queries from our production logs. While different queries had different latency and put different load on the system, the overall pool was big enough that any random 10k queries probably had about the same distribution of costs.

gkamat commented 1 year ago

@msfroh, this is a feature other folks have been interested in as well, although the nuances of how it would work have been described in various ways. There is an issue that touches on adding the capability to OSB to auto-scale and ramp up the intensity of the workload. As you mentioned, this would need to take into account the differences between operations. For instance, a complex aggregation would behave quite differently than a term query. Scaling up will also need to take into account the capabilities of distributed workload generation.

Once the features requested in the issue above are implemented, it will be the right point to address your request.

jmazanec15 commented 12 months ago

Related to #199 and https://github.com/opensearch-project/neural-search/issues/430, I think we are adding a lot more options in OpenSearch recently for improving search relevance. Users can use neural queries, neural sparse queries, custom re-rankers, etc. in order to achieve better quality results. On top of this, for the ml side, they may have several models to choose from for each configurations.

That being said, with all of those options, it can be really hard for users to converge on an optimal configuration for search. While OSB allows users to get metrics around performance (latency, throughput etc.), users are unable to determine the benefits in search relevance they are getting. I know that it may not make sense to measure relevance information while running throughput/latency focused steps due to overhead, but I think that it would make sense to incorporate certain steps dedicated to relevance in the dedicated search workloads. That way, a user only needs to run one workload in order to understand tradeoffs wrt to performance and relevance.

Please let me know if it makes sense to discuss in a separate issue.

bbarani commented 11 months ago

Referencing issue related to autoscale feature - https://github.com/opensearch-project/opensearch-benchmark/issues/373

opensearch-project / opensearch-benchmark