opensearch-project / opensearch-benchmark

OpenSearch Benchmark - a community driven, open source project to run performance tests for OpenSearch
https://opensearch.org/docs/latest/benchmark/
Apache License 2.0
111 stars 78 forks source link

[RFC]: Introducing Aggregation and Enhanced Comparison for OSB #627

Open OVI3D0 opened 2 months ago

OVI3D0 commented 2 months ago

Synopsis

OpenSearch Benchmark (OSB) is a performance testing tool for OpenSearch, a community-driven, open source search and analytics suite. It allows users to benchmark various aspects of OpenSearch, such as indexing, querying, and more, under different configurations and workloads. The Compare API is a feature in OSB that allows users to analyze and compare the performance differences between two benchmark test executions. While valuable, the current implementation has certain limitations. This RFC proposes enhancements to the Compare API which will improve how OSB analyzes and presents benchmark results, making OSB a more versatile tool for users in the OpenSearch community.

Motivation

Upon executing a test, OSB assigns a unique ID to each test execution result. The current implementation of the Compare API in OSB allows users to compare and analyze the results of two benchmark test executions by providing the UID of a test execution to be used as a baseline, as well as the UID of a contender which is compared to the baseline. Users can obtain these test execution IDs using the opensearch-benchmark list test-executions command.

The following is an example of how the compare API is invoked and its respective output.

$ opensearch-benchmark compare --baseline=729291a0-ee87-44e5-9b75-cc6d50c89702 --contender=a33845cc-c2e5-4488-a2db-b0670741ff9b
   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

Comparing baseline
  TestExecution ID: 729291a0-ee87-44e5-9b75-cc6d50c89702
  TestExecution timestamp: 2023-05-24 18:17:18 

with contender
  TestExecution ID: a33845cc-c2e5-4488-a2db-b0670741ff9b
  TestExecution timestamp: 2023-05-23 21:31:45

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
                                                  Metric    Baseline    Contender               Diff
--------------------------------------------------------  ----------  -----------  -----------------
                        Min Indexing Throughput [docs/s]       19501        19118  -383.00000
                     Median Indexing Throughput [docs/s]       20232      19927.5  -304.45833
                        Max Indexing Throughput [docs/s]       21172        20849  -323.00000
...
               Query latency term (50.0 percentile) [ms]     2.10049      2.15421    +0.05372
               Query latency term (90.0 percentile) [ms]     2.77537      2.84168    +0.06630
              Query latency term (100.0 percentile) [ms]     4.52081      5.15368    +0.63287

The comparison output shows metrics and percent difference between the tests. This is particularly useful when evaluating the performance differences across test runs and OpenSearch versions and configurations. The Compare API comes with additional command-line options, such as including specific percentiles in the comparison, exporting the comparison to different output formats, and appending the comparison in the results file.

However, the Compare API has limitations.

In performance testing, it is common practice to run the same test multiple times to account for any variability and ensure more consistent results. This variability can arise from various factors in the environment, as well as random fluctuations in the test environment. By aggregating the results, users can obtain a more reliable and representative measure of performance, reducing the impact of outliers or random variations.

Requirements

To address the limitations of the compare API and to enhance the overall data processing experience in OSB, the following capabilities should be added.

Proposed Solutions:

For example, if we have three test executions with the following median indexing throughput values and iteration counts:

- Test Execution 1: Median Indexing Throughput = 20,000 docs/s, Iterations = 1,000
- Test Execution 2: Median Indexing Throughput = 18,000 docs/s, Iterations = 2,000
- Test Execution 3: Median Indexing Throughput = 22,000 docs/s, Iterations = 1,500

The weighted average for median indexing throughput would be calculated as such:

Weighted Sum = (20,000 * 1,000) + (18,000 * 2,000) + (22,000 * 1,500)
            = 20,000,000 + 36,000,000 + 33,000,000
            = 89,000,000

Total Iterations = 1,000 + 2,000 + 1,500 = 4,500

Weighted Average Median Indexing Throughput = Weighted Sum / Total Iterations
                                            = 89,000,000 / 4,500
                                            = 19,777.78 docs/s

Example usage:

opensearch-benchmark aggregate --test-executions=<test_execution_id1>,<test_execution_id2>,...

Subsequent issues will be created to address these requirements further and elaborate on implementation details.

Stakeholders

Use Cases

How Can You Help?

Open Questions

  1. Are there any other output formats that would be useful besides Markdown, CSV, and JSON?
  2. Are there any other statistical metrics that would be valuable to include in the aggregated results?
  3. How should we handle potential inconsistencies in workload configurations when aggregating results from multiple test executions?

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

IanHoang commented 2 months ago

This will be a great addition to OpenSearch Benchmark as it addresses several pain points that several users have had for years. It will also diversify OSB's capabilities and open up new development opportunities.

To add to the second proposed priority, when validating if the comparison can be performed, the compare feature should also determine if the two ids' test procedures (or scenarios) are different. Some things to also consider:

Overall, great RFC and am excited to see what comes out of this!