Spike - Indexer performance on different Data Persistence Model designs

AlexRuiz7 commented 3 months ago

Description

As part of the new Data Persistence Model to be implemented across Wazuh, we need to elaborate a performance analysis over different designs, in order to see how the indexer behaves on them.

The objective of this issue is to measure the performance of bulk requests for:

indexing
updates
deletions

on:

stateless stream indices, using rollover and alias. Stateless indices do not wait for complete (non-blocking request)
stateful indices, without rollover and alias. Stateful indices do wait for complete (blocking request)

given the following scenarios:

Single bulk request.
- 1x (big) bulk request with Stateless and Stateful data.

graph LR
    A[Server cluster] -->|Single bulk| B[Indexer cluster]

Per-module bulk requests.
- 1x (smaller) bulk request for Stateless data.
- 3x (smaller) bulk request for Stateful data (state_1, state_2, state_3).

graph LR
    A[Server cluster] -->|Stateless bulk| B[Indexer cluster]
    A[Server cluster] -->|state_1 bulk| B[Indexer cluster]
    A[Server cluster] -->|state_2 bulk| B[Indexer cluster]
    A[Server cluster] -->|state_3 bulk| B[Indexer cluster]

The goal is to discover which design performs better on a well-configured indexer cluster.

For the tests, we are considering mocking events for 5K agents, generating events of 1 KB maximum. The EPS for each of the indices is defined by the formula below:

n_agents = 5000
req_size = 1 KB
stateful = 1 EPS   * n_agents = 5000 EPS  (5 MB)
state_1  = 0.6 EPS * n_agents = 3000 EPS  (3 MB)
state_2  = 0.3 EPS * n_agents = 1500 EPS  (1.5 MB)
state_3  = 0.1 EPS * n_agents =  500 EPS  (0.5 MB)
                                          --------
Total / single bulk request                10 MB

Functional requirements

Measure the performance of bulk requests for indexing, updates and deletions on both scenarios.

Implementation restrictions

Both test scenario must run on:

3x Indexer nodes, having 8 CPUs, 16 GB of RAM and SSD storage each.
5,000 agents (mocked)
Bulk requests are performed every second.
EPS distribution of requests per type are:
- Stateless. 100 %
- 3 stateful
  - state_1 10 %
  - state_2 30 %
  - state_3 60 %
Each request weights 1 Kilobyte.
Indices refresh every 5 seconds.

Plan

Define measurement tooling (most likely OpenSearch Benchmark)
Define performance metrics.
Perform test on both scenarios (local).
Acquire the required infrastructure.
Perform tests on scenario A.
Perform tests on scenario B.
Results comparisons and conclusions.

AlexRuiz7 commented 3 months ago

OpenSearch Benchmark

Currently reading the docs to understand how OSB works. Some notes:

Running OSB on Docker has some limitations.
Seems like OSB is able to spawn OpenSearch nodes by its own.

from-sources: Builds and provisions OpenSearch, runs a benchmark, and then publishes the results. from-distribution: Downloads an OpenSearch distribution, provisions it, runs a benchmark, and then publishes the results. benchmark-only: The default pipeline. Assumes an already running OpenSearch instance, runs a benchmark on that instance, and then publishes the results.

I managed to run OSB locally using Pyenv.

Details

```console (.venv) @alex-GL66 ➜ opensearch-benchmark python3 -m venv (.venv) @alex-GL66 ➜ opensearch-benchmark pip install (.venv) @alex-GL66 ➜ opensearch-benchmark export JAVA17_HOME=/u (.venv) @alex-GL66 ➜ opensearch-benchmark opensearch-benchmark ____ _____ __ ____ / __ \____ ___ ____ / ___/___ ____ ___________/ /_ / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / \____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_/ [INFO] [Test Execution ID]: cf58479a-77c9-4694-8b88-bfee848cdfa6 [INFO] Preparing for test execution ... [INFO] Downloading OpenSearch 2.13.0 (844.4 MB total size) [INFO] Downloading workload data (191 bytes total size) [INFO] Decompressing workload data from [/home/alex/wazuh/opensear [INFO] Preparing file offset table for [/home/alex/wazuh/opensearc [INFO] Executing test with workload [percolator], test_procedure Running delete-index Running create-index Running check-cluster-health Running index Running refresh-after-index Running force-merge Running refresh-after-force-merge Running wait-until-merges-finish Running percolator_with_content_president_bush Running percolator_with_content_saddam_hussein Running percolator_with_content_hurricane_katrina Running percolator_with_content_google Running percolator_no_score_with_content_google Running percolator_with_highlighting Running percolator_with_content_ignore_me Running percolator_no_score_with_content_ignore_me ------------------------------------------------------ _______ __ _____ / ____(_)___ ____ _/ / / ___/_________ ________ / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ /_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------ | Metric | |---------------------------------------------------------------:| | Cumulative indexing time of primary shards | | Min cumulative indexing time across primary shards | | Median cumulative indexing time across primary shards | | Max cumulative indexing time across primary shards | | Cumulative indexing throttle time of primary shards | | Min cumulative indexing throttle time across primary shards | | Median cumulative indexing throttle time across primary shards | | Max cumulative indexing throttle time across primary shards | | Cumulative merge time of primary shards | | Cumulative merge count of primary shards | | Min cumulative merge time across primary shards | | Median cumulative merge time across primary shards | | Max cumulative merge time across primary shards | | Cumulative merge throttle time of primary shards | | Min cumulative merge throttle time across primary shards | | Median cumulative merge throttle time across primary shards | | Max cumulative merge throttle time across primary shards | | Cumulative refresh time of primary shards | | Cumulative refresh count of primary shards | | Min cumulative refresh time across primary shards | | Median cumulative refresh time across primary shards | | Max cumulative refresh time across primary shards | | Cumulative flush time of primary shards | | Cumulative flush count of primary shards | | Min cumulative flush time across primary shards | | Median cumulative flush time across primary shards | | Max cumulative flush time across primary shards | | Total Young Gen GC time | | Total Young Gen GC count | | Total Old Gen GC time | | Total Old Gen GC count | | Store size | | Translog size | | Heap used for segments | | Heap used for doc values | | Heap used for terms | | Heap used for norms | | Heap used for points | | Heap used for stored fields | | Segment count | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 50th percentile latency | | 100th percentile latency | | 50th percentile service time | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | | Mean Throughput | | Median Throughput | | Max Throughput | | 100th percentile latency | | 100th percentile service time | | error rate | | Min Throughput | Mean Throughput | Median Throughput | Max Throughput | 100th percentile | 100th percentile service | error -------------------------------- [INFO] SUCCESS (took 89 seconds) -------------------------------- ``` .venv; source .venv/bin/activate opensearch-benchmark sr/lib/jvm/temurin-17-jdk-amd64 execute-test --distribution-version=2.13.0 --workload percolator --test-mode __ __ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/ / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,< /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_| [100%] [100.0%] ch-benchmark/.benchmark/benchmarks/data/percolator/queries-2-1k.json.bz2] to [/home/alex/wazuh/opensearch-benchmark/.benchmark/benchmarks/data/percolator/queries-2-1k.json] ... [OK] h-benchmark/.benchmark/benchmarks/data/percolator/queries-2-1k.json] ... [OK] [append-no-conflicts] and provision_config_instance ['defaults'] with version [2.13.0]. [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] [100% done] Task | Value | Unit | -------------------------------------------:|------------:|-------:| | 0.0122667 | min | | 0 | min | | 0.00209167 | min | | 0.004 | min | | 0 | min | | 0 | min | | 0 | min | | 0 | min | | 0 | min | | 0 | | | 0 | min | | 0 | min | | 0 | min | | 0 | min | | 0 | min | | 0 | min | | 0 | min | | 0.00226667 | min | | 30 | | | 0 | min | | 0.000358333 | min | | 0.0007 | min | | 0 | min | | 0 | | | 0 | min | | 0 | min | | 0 | min | | 0 | s | | 0 | | | 0 | s | | 0 | | | 4.31528e-05 | GB | | 3.07336e-07 | GB | | 0 | MB | | 0 | MB | | 0 | MB | | 0 | MB | | 0 | MB | | 0 | MB | | 22 | | index | 10299.1 | docs/s | index | 10299.1 | docs/s | index | 10299.1 | docs/s | index | 10299.1 | docs/s | index | 81.7099 | ms | index | 91.5731 | ms | index | 81.7099 | ms | index | 91.5731 | ms | index | 0 | % | wait-until-merges-finish | 72.58 | ops/s | wait-until-merges-finish | 72.58 | ops/s | wait-until-merges-finish | 72.58 | ops/s | wait-until-merges-finish | 72.58 | ops/s | wait-until-merges-finish | 13.1389 | ms | wait-until-merges-finish | 13.1389 | ms | wait-until-merges-finish | 0 | % | percolator_with_content_president_bush | 32.24 | ops/s | percolator_with_content_president_bush | 32.24 | ops/s | percolator_with_content_president_bush | 32.24 | ops/s | percolator_with_content_president_bush | 32.24 | ops/s | percolator_with_content_president_bush | 37.6739 | ms | percolator_with_content_president_bush | 6.40732 | ms | percolator_with_content_president_bush | 0 | % | percolator_with_content_saddam_hussein | 115.68 | ops/s | percolator_with_content_saddam_hussein | 115.68 | ops/s | percolator_with_content_saddam_hussein | 115.68 | ops/s | percolator_with_content_saddam_hussein | 115.68 | ops/s | percolator_with_content_saddam_hussein | 14.9318 | ms | percolator_with_content_saddam_hussein | 5.95973 | ms | percolator_with_content_saddam_hussein | 0 | % | percolator_with_content_hurricane_katrina | 84.38 | ops/s | percolator_with_content_hurricane_katrina | 84.38 | ops/s | percolator_with_content_hurricane_katrina | 84.38 | ops/s | percolator_with_content_hurricane_katrina | 84.38 | ops/s | percolator_with_content_hurricane_katrina | 18.1493 | ms | percolator_with_content_hurricane_katrina | 5.96843 | ms | percolator_with_content_hurricane_katrina | 0 | % | percolator_with_content_google | 47.06 | ops/s | percolator_with_content_google | 47.06 | ops/s | percolator_with_content_google | 47.06 | ops/s | percolator_with_content_google | 47.06 | ops/s | percolator_with_content_google | 27.8973 | ms | percolator_with_content_google | 6.37702 | ms | percolator_with_content_google | 0 | % | percolator_no_score_with_content_google | 101.72 | ops/s | percolator_no_score_with_content_google | 101.72 | ops/s | percolator_no_score_with_content_google | 101.72 | ops/s | percolator_no_score_with_content_google | 101.72 | ops/s | percolator_no_score_with_content_google | 17.8059 | ms | percolator_no_score_with_content_google | 7.73091 | ms | percolator_no_score_with_content_google | 0 | % | percolator_with_highlighting | 81.3 | ops/s | percolator_with_highlighting | 81.3 | ops/s | percolator_with_highlighting | 81.3 | ops/s | percolator_with_highlighting | 81.3 | ops/s | percolator_with_highlighting | 20.5377 | ms | percolator_with_highlighting | 7.81483 | ms | percolator_with_highlighting | 0 | % | percolator_with_content_ignore_me | 17.47 | ops/s | percolator_with_content_ignore_me | 17.47 | ops/s | percolator_with_content_ignore_me | 17.47 | ops/s | percolator_with_content_ignore_me | 17.47 | ops/s | percolator_with_content_ignore_me | 85.7778 | ms | percolator_with_content_ignore_me | 28.0983 | ms | percolator_with_content_ignore_me | 0 | % | | percolator_no_score_with_content_ignore_me | 54.39 | ops/s | | percolator_no_score_with_content_ignore_me | 54.39 | ops/s | | percolator_no_score_with_content_ignore_me | 54.39 | ops/s | | percolator_no_score_with_content_ignore_me | 54.39 | ops/s | latency | percolator_no_score_with_content_ignore_me | 26.549 | ms | time | percolator_no_score_with_content_ignore_me | 7.92226 | ms | rate | percolator_no_score_with_content_ignore_me | 0 | % |

We'll work on creating a Vagrant environment with 3 OpenSearch nodes and OpenSearch Benchmark installed on each of them to perform the tests. See Running distributed loads.

AlexRuiz7 commented 3 months ago

Update

We have generated a Vagrant environment that sets up a cluster with 3 nodes of OpenSearch v2.14.0 + OpenSearch Benchmark 1.6.0. They are configured among themselves to make use of distributed load and load balancing.

The command in the OpenSearch documentation is not correct, but we have fixed it.

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=eventdata --load-worker-coordinator-hosts=node-2,node-3 --target-hosts=node-1 --kill-running-processes

Even so, we have not managed to make it work, as it fails after the following error:

2024-06-13 10:13:11,377 -not-actor-/PID:3267 osbenchmark.benchmark ERROR Cannot run subcommand [execute-test].

We then realized this mode is useful for big cluster with massive loads that reach max CPU usage on the host running OSB (+3 nodes). This mode allows dedicating some nodes of the cluster to distribute the workload generation (not the processing). We fall back to regular mode. Fortunately, we can reuse the Vagrantfile.

https://github.com/opensearch-project/opensearch-benchmark/issues/258

AlexRuiz7 commented 3 months ago

Running a workload

With this command, we can run the default http_logs workload. This workload mixes ingest, update and search queries.

[!NOTE] This operation is time consuming.

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=http_logs --target-host=https://localhost:9200 --client-options=basic_auth_user:admin,basic_auth_password:"${OPENSEARCH_INITIAL_ADMIN_PASSWORD}",verify_certs:false

Creating a custom workload

There are 2 ways of creating custom workloads:

From an existing cluster with indexed data: OSB can automatically generate a custom workload using indexed data (at least 1k documents). The workload has to be configured manually once created.

opensearch-benchmark create-workload \
--workload="<WORKLOAD NAME>" \
--target-hosts="<CLUSTER ENDPOINT>" \
--client-options="basic_auth_user:'<USERNAME>',basic_auth_password:'<PASSWORD>'" \
--indices="<INDEXES TO GENERATE WORKLOAD FROM>" \
--output-path="<LOCAL DIRECTORY PATH TO STORE WORKLOAD>"

From scratch. The workload generation and its configuration is manual. We have tried to follow the example in the docs without success.

opensearch-benchmark execute-test \
--pipeline="benchmark-only" \
--workload-path="./workload" \
--target-host="https://localhost:9200/" \
--client-options="basic_auth_user:'admin',basic_auth_password:'Bc9ZyWSBu19[BK#6MBgbJ98Tofv)Vsw',verify_certs:false"

f-galland commented 3 months ago

I tested creating a workload from an existing cluster, for which I used a test AIO deployment with real world data.

I used this command:

opensearch-benchmark create-workload \
--workload="wazuh-test" \
--target-hosts="https://localhost:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin',verify_certs:false" \
--indices="wazuh-alerts-4.x-2024.04.22" \
--output-path="./wazuh-workload"

I then run the test:

opensearch-benchmark execute-test \
--pipeline="benchmark-only" \
--workload-path="./wazuh-workload/wazuh-test" \
--target-host="https://localhost:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin',verify_certs:false"

Below is the result:

# ./run_custom_workload.sh 

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 586e8225-db0d-4f26-bcb8-ce616f6b8ec6
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[INFO] Executing test with workload [wazuh-test], test_procedure [default-test-procedure] and provision_config_instance ['external'] with version [7.10.2].

[WARNING] merges_total_time is 93391 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 59258 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 559684 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 21547 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running index-append                                                           [100% done]
Running refresh-after-index                                                    [100% done]
Running force-merge                                                            [100% done]
Running refresh-after-force-merge                                              [100% done]
Running wait-until-merges-finish                                               [100% done]
Running match-all                                                              [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |                     Task |      Value |   Unit |
|---------------------------------------------------------------:|-------------------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                          |   0.985733 |    min |
|             Min cumulative indexing time across primary shards |                          |          0 |    min |
|          Median cumulative indexing time across primary shards |                          |          0 |    min |
|             Max cumulative indexing time across primary shards |                          |   0.142783 |    min |
|            Cumulative indexing throttle time of primary shards |                          |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                          |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                          |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                          |          0 |    min |
|                        Cumulative merge time of primary shards |                          |    1.56205 |    min |
|                       Cumulative merge count of primary shards |                          |       6115 |        |
|                Min cumulative merge time across primary shards |                          |          0 |    min |
|             Median cumulative merge time across primary shards |                          |          0 |    min |
|                Max cumulative merge time across primary shards |                          |   0.178017 |    min |
|               Cumulative merge throttle time of primary shards |                          |          0 |    min |
|       Min cumulative merge throttle time across primary shards |                          |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                          |          0 |    min |
|       Max cumulative merge throttle time across primary shards |                          |          0 |    min |
|                      Cumulative refresh time of primary shards |                          |    9.35108 |    min |
|                     Cumulative refresh count of primary shards |                          |      57555 |        |
|              Min cumulative refresh time across primary shards |                          |          0 |    min |
|           Median cumulative refresh time across primary shards |                          |          0 |    min |
|              Max cumulative refresh time across primary shards |                          |    1.29032 |    min |
|                        Cumulative flush time of primary shards |                          |     0.3596 |    min |
|                       Cumulative flush count of primary shards |                          |        726 |        |
|                Min cumulative flush time across primary shards |                          |          0 |    min |
|             Median cumulative flush time across primary shards |                          |          0 |    min |
|                Max cumulative flush time across primary shards |                          |   0.122833 |    min |
|                                        Total Young Gen GC time |                          |      0.014 |      s |
|                                       Total Young Gen GC count |                          |          1 |        |
|                                          Total Old Gen GC time |                          |          0 |      s |
|                                         Total Old Gen GC count |                          |          0 |        |
|                                                     Store size |                          |   0.116531 |     GB |
|                                                  Translog size |                          | 0.00978717 |     GB |
|                                         Heap used for segments |                          |          0 |     MB |
|                                       Heap used for doc values |                          |          0 |     MB |
|                                            Heap used for terms |                          |          0 |     MB |
|                                            Heap used for norms |                          |          0 |     MB |
|                                           Heap used for points |                          |          0 |     MB |
|                                    Heap used for stored fields |                          |          0 |     MB |
|                                                  Segment count |                          |        722 |        |
|                                                 Min Throughput |             index-append |    6846.26 | docs/s |
|                                                Mean Throughput |             index-append |    6846.26 | docs/s |
|                                              Median Throughput |             index-append |    6846.26 | docs/s |
|                                                 Max Throughput |             index-append |    6846.26 | docs/s |
|                                        50th percentile latency |             index-append |    240.359 |     ms |
|                                       100th percentile latency |             index-append |    243.049 |     ms |
|                                   50th percentile service time |             index-append |    240.359 |     ms |
|                                  100th percentile service time |             index-append |    243.049 |     ms |
|                                                     error rate |             index-append |          0 |      % |
|                                                 Min Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                                Mean Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                              Median Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                                 Max Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                       100th percentile latency | wait-until-merges-finish |    37.1372 |     ms |
|                                  100th percentile service time | wait-until-merges-finish |    37.1372 |     ms |
|                                                     error rate | wait-until-merges-finish |          0 |      % |
|                                                 Min Throughput |                match-all |       3.02 |  ops/s |
|                                                Mean Throughput |                match-all |       3.03 |  ops/s |
|                                              Median Throughput |                match-all |       3.03 |  ops/s |
|                                                 Max Throughput |                match-all |       3.05 |  ops/s |
|                                        50th percentile latency |                match-all |    6.85162 |     ms |
|                                        90th percentile latency |                match-all |    7.55348 |     ms |
|                                        99th percentile latency |                match-all |    8.55737 |     ms |
|                                       100th percentile latency |                match-all |    9.84485 |     ms |
|                                   50th percentile service time |                match-all |    4.84304 |     ms |
|                                   90th percentile service time |                match-all |    5.46714 |     ms |
|                                   99th percentile service time |                match-all |    6.36302 |     ms |
|                                  100th percentile service time |                match-all |    7.92025 |     ms |
|                                                     error rate |                match-all |          0 |      % |

---------------------------------
[INFO] SUCCESS (took 109 seconds)
---------------------------------

f-galland commented 3 months ago

It looks like we can indeed run tasks concurrently, using the clients and parallel keywords.

f-galland commented 3 months ago

Using the method to create a custom workload described above, I created a workload with the following test procedures:

root@os-benchmarks:~# cat benchmarks/wazuh-alerts/test_procedures/default.json
{
  "name": "parallel-any",
  "description": "Workload completed-by property",
  "schedule": [
    {
      "parallel": {
        "tasks": [
          {
            "name": "parellel-task-1",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 1000
            },
            "clients": 100
          },
          {
            "name": "parellel-task-2",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 1000
            },
            "clients": 100
          }
        ]
      }
    }
  ]
}

This was run with the following docker environment:

services:

  opensearch-benchmark:
    image: opensearchproject/opensearch-benchmark:1.6.0
    hostname: opensearch-benchmark
    depends_on:
      opensearch-node1:
        condition: service_healthy
      permissions-setter:
        condition: service_completed_successfully
    container_name: opensearch-benchmark
    volumes:
      - ./benchmarks:/opensearch-benchmark/.benchmark
    environment:
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
        #command: execute-test --target-hosts https://opensearch-node1:9200 --pipeline benchmark-only --workload geonames --client-options basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false --test-mode
    command: execute-test --pipeline="benchmark-only" --workload-path="/opensearch-benchmark/.benchmark/wazuh-alerts" --target-host="https://opensearch-node1:9200" --client-options="basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false"

    networks:
      - opensearch-net

  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:2.14.0
    container_name: opensearch-node1
    hostname: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster # Name the cluster
      - node.name=opensearch-node1 # Name the node that will run in this container
      - discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} # Sets the demo admin user password when using demo configuration (for OpenSearch 2.12 and later)
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    healthcheck:
      test: curl -sku admin:${OPENSEARCH_INITIAL_ADMIN_PASSWORD} https://localhost:9200/_cat/health | grep -q opensearch-cluster
      start_period: 10s
      start_interval: 3s
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network

  opensearch-node2:
    image: opensearchproject/opensearch:2.14.0 # This should be the same image used for opensearch-node1 to avoid issues
    container_name: opensearch-node2
    hostname: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.14.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
    depends_on:
      opensearch-node1:
        condition: service_healthy
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]' # Define the OpenSearch nodes that OpenSearch Dashboards will query
    networks:
      - opensearch-net

  permissions-setter:
    image: alpine:3.14
    container_name: permissions-setter
    volumes:
      - ./benchmarks:/benchmark
    entrypoint: /bin/sh
    command: >
      -c '
        chmod -R a+rw /benchmark
      '

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

Below are the results of the test:

  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 41430721-7ced-41b4-b363-8eaf19f73221
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[WARNING] refresh_total_time is 6 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running parellel-task-2,parellel-task-1                                        [100% done][INFO] Executing test with workload [wazuh-alerts], test_procedure [parallel-any] and provision_config_instance ['external'] with version [2.14.0].

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |            Task |       Value |   Unit |
|---------------------------------------------------------------:|----------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                 |      0.0863 |    min |
|             Min cumulative indexing time across primary shards |                 |           0 |    min |
|          Median cumulative indexing time across primary shards |                 |           0 |    min |
|             Max cumulative indexing time across primary shards |                 |      0.0863 |    min |
|            Cumulative indexing throttle time of primary shards |                 |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                 |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                 |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                 |           0 |    min |
|                        Cumulative merge time of primary shards |                 |           0 |    min |
|                       Cumulative merge count of primary shards |                 |           0 |        |
|                Min cumulative merge time across primary shards |                 |           0 |    min |
|             Median cumulative merge time across primary shards |                 |           0 |    min |
|                Max cumulative merge time across primary shards |                 |           0 |    min |
|               Cumulative merge throttle time of primary shards |                 |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                 |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                 |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                 |           0 |    min |
|                      Cumulative refresh time of primary shards |                 |      0.0001 |    min |
|                     Cumulative refresh count of primary shards |                 |          77 |        |
|              Min cumulative refresh time across primary shards |                 |           0 |    min |
|           Median cumulative refresh time across primary shards |                 |           0 |    min |
|              Max cumulative refresh time across primary shards |                 | 8.33333e-05 |    min |
|                        Cumulative flush time of primary shards |                 |           0 |    min |
|                       Cumulative flush count of primary shards |                 |           0 |        |
|                Min cumulative flush time across primary shards |                 |           0 |    min |
|             Median cumulative flush time across primary shards |                 |           0 |    min |
|                Max cumulative flush time across primary shards |                 |           0 |    min |
|                                        Total Young Gen GC time |                 |       0.086 |      s |
|                                       Total Young Gen GC count |                 |           7 |        |
|                                          Total Old Gen GC time |                 |           0 |      s |
|                                         Total Old Gen GC count |                 |           0 |        |
|                                                     Store size |                 |   0.0571623 |     GB |
|                                                  Translog size |                 |   0.0342067 |     GB |
|                                         Heap used for segments |                 |           0 |     MB |
|                                       Heap used for doc values |                 |           0 |     MB |
|                                            Heap used for terms |                 |           0 |     MB |
|                                            Heap used for norms |                 |           0 |     MB |
|                                           Heap used for points |                 |           0 |     MB |
|                                    Heap used for stored fields |                 |           0 |     MB |
|                                                  Segment count |                 |          73 |        |
|                                                 Min Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                                Mean Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                              Median Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                                 Max Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                        50th percentile latency | parellel-task-1 |     963.212 |     ms |
|                                       100th percentile latency | parellel-task-1 |      1050.9 |     ms |
|                                   50th percentile service time | parellel-task-1 |     963.212 |     ms |
|                                  100th percentile service time | parellel-task-1 |      1050.9 |     ms |
|                                                     error rate | parellel-task-1 |           0 |      % |
|                                                 Min Throughput | parellel-task-2 |      752.03 | docs/s |
|                                                Mean Throughput | parellel-task-2 |      752.03 | docs/s |
|                                              Median Throughput | parellel-task-2 |      752.03 | docs/s |
|                                                 Max Throughput | parellel-task-2 |      752.03 | docs/s |
|                                        50th percentile latency | parellel-task-2 |     991.137 |     ms |
|                                       100th percentile latency | parellel-task-2 |     1094.09 |     ms |
|                                   50th percentile service time | parellel-task-2 |     991.137 |     ms |
|                                  100th percentile service time | parellel-task-2 |     1094.09 |     ms |
|                                                     error rate | parellel-task-2 |           0 |      % |

--------------------------------
[INFO] SUCCESS (took 16 seconds)
--------------------------------

The clients and bulk-size parameters seem to correlate with the actual amount of data being indexed:

root@os-benchmarks:~# curl -ku admin:Secret.Password.1234 https://localhost:9200/_cat/indices?s=store.size
green open .opensearch-observability    VEodRP5XRWCaUTyIxp947g 1 1     0 0    416b    208b
green open .ql-datasources              kKh4Hp4HQeaM17jF-h4ZFg 1 1     0 0    416b    208b
green open .plugins-ml-config           KqwdywM0QpGBHGDBgxTsPA 1 1     1 0   7.8kb   3.9kb
green open .kibana_92668751_admin_1     75yGMhz_S42b_lxukdV5zA 1 1     1 0  10.3kb   5.1kb
green open .kibana_1                    ehjT55saT0S_2ragBN9O_g 1 1     1 0  10.3kb   5.1kb
green open .opendistro_security         tHeF1aZ6SImXyyB_TGVeDA 1 1    10 0  97.8kb  48.9kb
green open security-auditlog-2024.06.12 iGBt812KR4CMYZR0WAlprA 1 1    55 0 143.3kb  63.9kb
green open queries                      WaVFFYN-QDyU4WCO-kDdPA 5 0  1000 0 196.2kb 196.2kb
green open security-auditlog-2024.06.25 9QI-2iC7QM2ocX6LvfHHBw 1 1   263 0 530.9kb 274.2kb
green open wazuh-alerts-4.x-2024.05     ji1Q8AvcQHePD2LeSSbRDg 1 1 31480 0  47.6mb  22.9mb

f-galland commented 3 months ago

An OpenSearch Benchmark workload can run various types of operations.

The bulk operation seems to be the only one at the document level, and I cannot find a mentions of it being capable of bulk-updating or bulk-deleting documents.

There still might be a way to achieve this since operations like force-merge are mentioned throughout the documentation, but there is no reference for it.

f-galland commented 3 months ago

The nyc taxis sample workload seems to include an update operation.

f-galland commented 3 months ago

New operation-types can be defined as functions in a workload.py file.

Here is an example of a reindex operation being referenced in a test procedure, and the corresponding definition of the custom operation:

f-galland commented 3 months ago

I tried creating a workload that uses the bulk operation-type with metadata indicating the action (indexing, deleting, updating documents).

My corpora looks as follows:

test.json

root@os-benchmarks:~/benchmarks/wazuh-alerts# cat test.json
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e5" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:45:54"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":786,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:45:54 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:45:55.891-03:00","location":"/var/log/auth.log","id":"1714661155.429268","timestamp":"2024-05-02T11:45:55.891-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e6" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:45:56"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":787,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:45:56 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:45:57.894-03:00","location":"/var/log/auth.log","id":"1714661157.429646","timestamp":"2024-05-02T11:45:57.894-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e7" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:45:58"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":788,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:45:58 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:45:59.896-03:00","location":"/var/log/auth.log","id":"1714661159.430024","timestamp":"2024-05-02T11:45:59.896-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e8" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:46:13"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":795,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:46:13 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:46:13.912-03:00","location":"/var/log/auth.log","id":"1714661173.432670","timestamp":"2024-05-02T11:46:13.912-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e9" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:46:15"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":796,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:46:15 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:46:15.914-03:00","location":"/var/log/auth.log","id":"1714661175.433048","timestamp":"2024-05-02T11:46:15.914-0300"

workload.json

{% import "benchmark.helpers" as benchmark with context %}
{
  "version": 2,
  "description": "Tracker-generated workload for wazuh-alerts",
  "indices": [
    {
      "name": "wazuh-alerts-4.x-2024.05",
      "body": "wazuh-alerts-4.x-2024.05.json"
    }
  ],
  "corpora": [
    {
      "name": "wazuh-alerts-4.x-2024.05",
      "documents": [
        {
          "target-index": "wazuh-alerts-4.x-2024.05",
          "source-file": "test.json",
          "document-count": 10
        }
      ]
    }
  ],
  "operations": [
    {{ benchmark.collect(parts="operations/*.json") }}
  ],
  "test_procedures": [
    {{ benchmark.collect(parts="test_procedures/*.json") }}
  ]
}

test_procedures/default.json

{
    "name": "index-append",
    "operation-type": "bulk",
    "bulk-size": 5,
    "action-metadata-present": true,
    "ingest-percentage": 100
},
{
    "name": "wait-until-merges-finish",
    "operation-type": "index-stats",
    "index": "_all",
    "condition": {
      "path": "_all.total.merges.current",
      "expected-value": 0
    },
    "retry-until-success": true,
    "include-in-reporting": false
},
{
    "name": "match-all",
    "operation-type": "search",
    "index": "wazuh-alerts-4.x-2024.05",
    "body": {
        "size": 10,
        "query": {
            "match_all": {}
        }
    }
}

I set the action-metadata-present field to true in the above test procedure's bulk operation based on the following comment from the opensearch-benchmark code:

* ``action_metadata_present``: if ``True``, assume that an action and metadata line is present (meaning only half of the lines contain actual documents to index)

This resulted in the "metadata" lines of the test.json corpora to be indexed as if they were regular documents.

After closer inspection, I realized that the action_metadata_present flag doesn't really change the actual metadata used in the bulk operations in the code.

We need to determine whether we are to:

Modify opensearch-benchmark's code to add the feature
Create a workload.py in our benchmark that registers a runner that allows for mixed bulk operations (not just indexing)
Create our own benchmarking solution.

f-galland commented 3 months ago

I run two benchmarks on indexing operations only. These were run on the docker compose shared in a previous comment with a 3 node opensearch cluster.

The test_procedures were defined as follows:

{
  "name": "single-bulk",
  "description": "Customized test procedure with a single bulk request indexing 10k wazuh-alerts documents.",
  "schedule": [
    {
      "operation": {
        "name": "single-bulk-index-task",
        "operation-type": "bulk",
        "bulk-size": 10000
      }
    }
  ]
}

{
  "name": "parallel-any",
  "description": "Customized test procedure with a parallel bulk requests indexing 5k, 3k, 1.5k and 0.5k wazuh-alerts documents in parallel bulks.",
  "schedule": [
    {
      "parallel": {
        "tasks": [
          {
            "name": "5k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 5000
            },
            "clients": 1
          },
          {
            "name": "3k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 3000
            },
            "clients": 1
          },
          {
            "name": "1.5k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 1500
            },
            "clients": 1
          },
          {
            "name": "0.5k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 500
            },
            "clients": 1
          }
        ]
      }
    }
  ]
}

Results

Single 10k bulk:

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 6e80dbbd-f96b-4fe9-b685-0b63710abb0e
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[WARNING] indexing_total_time is 42 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 339 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running single-bulk-index-task                                                 [100% done]
[INFO] Executing test with workload [wazuh-alerts-single-bulk], test_procedure [default-test-procedure] and provision_config_instance ['external'] with version [2.14.0].

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |                   Task |      Value |   Unit |
|---------------------------------------------------------------:|-----------------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                        |    0.03075 |    min |
|             Min cumulative indexing time across primary shards |                        |          0 |    min |
|          Median cumulative indexing time across primary shards |                        |    0.00035 |    min |
|             Max cumulative indexing time across primary shards |                        |    0.02785 |    min |
|            Cumulative indexing throttle time of primary shards |                        |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                        |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                        |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                        |          0 |    min |
|                        Cumulative merge time of primary shards |                        | 0.00353333 |    min |
|                       Cumulative merge count of primary shards |                        |          5 |        |
|                Min cumulative merge time across primary shards |                        |          0 |    min |
|             Median cumulative merge time across primary shards |                        |          0 |    min |
|                Max cumulative merge time across primary shards |                        | 0.00353333 |    min |
|               Cumulative merge throttle time of primary shards |                        |          0 |    min |
|       Min cumulative merge throttle time across primary shards |                        |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                        |          0 |    min |
|       Max cumulative merge throttle time across primary shards |                        |          0 |    min |
|                      Cumulative refresh time of primary shards |                        |     0.0314 |    min |
|                     Cumulative refresh count of primary shards |                        |        127 |        |
|              Min cumulative refresh time across primary shards |                        |          0 |    min |
|           Median cumulative refresh time across primary shards |                        |   0.002825 |    min |
|              Max cumulative refresh time across primary shards |                        |  0.0139333 |    min |
|                        Cumulative flush time of primary shards |                        |          0 |    min |
|                       Cumulative flush count of primary shards |                        |          0 |        |
|                Min cumulative flush time across primary shards |                        |          0 |    min |
|             Median cumulative flush time across primary shards |                        |          0 |    min |
|                Max cumulative flush time across primary shards |                        |          0 |    min |
|                                        Total Young Gen GC time |                        |      0.162 |      s |
|                                       Total Young Gen GC count |                        |         17 |        |
|                                          Total Old Gen GC time |                        |          0 |      s |
|                                         Total Old Gen GC count |                        |          0 |        |
|                                                     Store size |                        |  0.0294357 |     GB |
|                                                  Translog size |                        |  0.0379527 |     GB |
|                                         Heap used for segments |                        |          0 |     MB |
|                                       Heap used for doc values |                        |          0 |     MB |
|                                            Heap used for terms |                        |          0 |     MB |
|                                            Heap used for norms |                        |          0 |     MB |
|                                           Heap used for points |                        |          0 |     MB |
|                                    Heap used for stored fields |                        |          0 |     MB |
|                                                  Segment count |                        |         26 |        |
|                                                 Min Throughput | single-bulk-index-task |     940.61 | docs/s |
|                                                Mean Throughput | single-bulk-index-task |    1205.66 | docs/s |
|                                              Median Throughput | single-bulk-index-task |    1205.66 | docs/s |
|                                                 Max Throughput | single-bulk-index-task |    1470.72 | docs/s |
|                                        50th percentile latency | single-bulk-index-task |     3858.7 |     ms |
|                                       100th percentile latency | single-bulk-index-task |    6775.95 |     ms |
|                                   50th percentile service time | single-bulk-index-task |     3858.7 |     ms |
|                                  100th percentile service time | single-bulk-index-task |    6775.95 |     ms |
|                                                     error rate | single-bulk-index-task |          0 |      % |

--------------------------------
[INFO] SUCCESS (took 21 seconds)
--------------------------------

Parallel indexing

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 63d78463-b816-44de-9a5f-16a08084a061
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[INFO] Preparing file offset table for [/opensearch-benchmark/.benchmark/wazuh-alerts-parallelized/wazuh-alerts-benchmark-data-documents.json] ... [OK]
[WARNING] indexing_total_time is 36 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 328 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running 3k-events-task,0.5k-events-task,5k-events-task,1.5k-events-task        [100% done][INFO] Executing test with workload [wazuh-alerts-parallelized], test_procedure [parallel-any] and provision_config_instance ['external'] with version [2.14.0].

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |             Task |      Value |   Unit |
|---------------------------------------------------------------:|-----------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                  |   0.270033 |    min |
|             Min cumulative indexing time across primary shards |                  |          0 |    min |
|          Median cumulative indexing time across primary shards |                  |     0.0003 |    min |
|             Max cumulative indexing time across primary shards |                  |   0.262817 |    min |
|            Cumulative indexing throttle time of primary shards |                  |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                  |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                  |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                  |          0 |    min |
|                        Cumulative merge time of primary shards |                  |     0.0474 |    min |
|                       Cumulative merge count of primary shards |                  |         16 |        |
|                Min cumulative merge time across primary shards |                  |          0 |    min |
|             Median cumulative merge time across primary shards |                  |          0 |    min |
|                Max cumulative merge time across primary shards |                  |  0.0384833 |    min |
|               Cumulative merge throttle time of primary shards |                  |          0 |    min |
|       Min cumulative merge throttle time across primary shards |                  |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                  |          0 |    min |
|       Max cumulative merge throttle time across primary shards |                  |          0 |    min |
|                      Cumulative refresh time of primary shards |                  |  0.0934167 |    min |
|                     Cumulative refresh count of primary shards |                  |        215 |        |
|              Min cumulative refresh time across primary shards |                  |          0 |    min |
|           Median cumulative refresh time across primary shards |                  | 0.00273333 |    min |
|              Max cumulative refresh time across primary shards |                  |  0.0502833 |    min |
|                        Cumulative flush time of primary shards |                  |          0 |    min |
|                       Cumulative flush count of primary shards |                  |          0 |        |
|                Min cumulative flush time across primary shards |                  |          0 |    min |
|             Median cumulative flush time across primary shards |                  |          0 |    min |
|                Max cumulative flush time across primary shards |                  |          0 |    min |
|                                        Total Young Gen GC time |                  |      0.654 |      s |
|                                       Total Young Gen GC count |                  |         65 |        |
|                                          Total Old Gen GC time |                  |          0 |      s |
|                                         Total Old Gen GC count |                  |          0 |        |
|                                                     Store size |                  |   0.111155 |     GB |
|                                                  Translog size |                  |   0.153145 |     GB |
|                                         Heap used for segments |                  |          0 |     MB |
|                                       Heap used for doc values |                  |          0 |     MB |
|                                            Heap used for terms |                  |          0 |     MB |
|                                            Heap used for norms |                  |          0 |     MB |
|                                           Heap used for points |                  |          0 |     MB |
|                                    Heap used for stored fields |                  |          0 |     MB |
|                                                  Segment count |                  |         24 |        |
|                                                 Min Throughput |   5k-events-task |        544 | docs/s |
|                                                Mean Throughput |   5k-events-task |     567.73 | docs/s |
|                                              Median Throughput |   5k-events-task |     544.24 | docs/s |
|                                                 Max Throughput |   5k-events-task |     614.94 | docs/s |
|                                        50th percentile latency |   5k-events-task |    2398.31 |     ms |
|                                       100th percentile latency |   5k-events-task |    9173.81 |     ms |
|                                   50th percentile service time |   5k-events-task |    2398.31 |     ms |
|                                  100th percentile service time |   5k-events-task |    9173.81 |     ms |
|                                                     error rate |   5k-events-task |          0 |      % |
|                                                 Min Throughput |   3k-events-task |     516.46 | docs/s |
|                                                Mean Throughput |   3k-events-task |        619 | docs/s |
|                                              Median Throughput |   3k-events-task |     608.69 | docs/s |
|                                                 Max Throughput |   3k-events-task |     732.44 | docs/s |
|                                        50th percentile latency |   3k-events-task |    1587.65 |     ms |
|                                       100th percentile latency |   3k-events-task |    5794.57 |     ms |
|                                   50th percentile service time |   3k-events-task |    1587.65 |     ms |
|                                  100th percentile service time |   3k-events-task |    5794.57 |     ms |
|                                                     error rate |   3k-events-task |          0 |      % |
|                                                 Min Throughput | 1.5k-events-task |     326.99 | docs/s |
|                                                Mean Throughput | 1.5k-events-task |     564.57 | docs/s |
|                                              Median Throughput | 1.5k-events-task |     585.95 | docs/s |
|                                                 Max Throughput | 1.5k-events-task |     780.17 | docs/s |
|                                        50th percentile latency | 1.5k-events-task |    1036.56 |     ms |
|                                       100th percentile latency | 1.5k-events-task |    4576.54 |     ms |
|                                   50th percentile service time | 1.5k-events-task |    1036.56 |     ms |
|                                  100th percentile service time | 1.5k-events-task |    4576.54 |     ms |
|                                                     error rate | 1.5k-events-task |          0 |      % |
|                                                 Min Throughput | 0.5k-events-task |     180.57 | docs/s |
|                                                Mean Throughput | 0.5k-events-task |     484.79 | docs/s |
|                                              Median Throughput | 0.5k-events-task |      520.7 | docs/s |
|                                                 Max Throughput | 0.5k-events-task |     851.66 | docs/s |
|                                        50th percentile latency | 0.5k-events-task |    364.129 |     ms |
|                                        90th percentile latency | 0.5k-events-task |    725.394 |     ms |
|                                       100th percentile latency | 0.5k-events-task |    2762.02 |     ms |
|                                   50th percentile service time | 0.5k-events-task |    364.129 |     ms |
|                                   90th percentile service time | 0.5k-events-task |    725.394 |     ms |
|                                  100th percentile service time | 0.5k-events-task |    2762.02 |     ms |
|                                                     error rate | 0.5k-events-task |          0 |      % |

--------------------------------
[INFO] SUCCESS (took 26 seconds)
--------------------------------

f-galland commented 3 months ago

It was determined that we need to test the optimal ingest bulk size in the 10-100MB range. For this, I set up a workload that progressively ramps the bulk size in 2MB intervals. The benchmark information will be stored to an opensearch cluster so we can plot it.

I'm currently working on setting up a benchmark as the one above on a 3 node cluster on top of 3 EC2 instances. Given the fact that we need the tests to be easily reproducible, I'm setting up dockerized nodes. This should allow bringing the node and its data down and back up in a few commands using docker contexts from an outside host.

The benchmark can be run locally from any terminal.

f-galland commented 3 months ago

I created a number of wazuh-alerts json files with file sizes ranging from 5MB through 100MB in 5MB increments.

root@os-benchmarks:~/benchmarks/wazuh-alerts# ls -lh
total 1.1G
-rwxrwxrwx 1 root root 1.1K Jul  1 20:05 generate_config.sh
-rwxrwxrwx 1 root root  270 Jul  1 19:14 generate_files.sh
drwxrwxrwx 2 root root 4.0K Jul  1 20:16 operations
drwxrwxrwx 2 root root 4.0K Jul  1 20:31 test_procedures
-rw-rw-rw- 1 root root  10M Jul  1 19:14 wazuh-alerts-10.json
-rw-rw-rw- 1 root root 100M Jul  1 19:06 wazuh-alerts-100.json
-rw-rw-rw- 1 root root  15M Jul  1 19:14 wazuh-alerts-15.json
-rw-rw-rw- 1 root root  20M Jul  1 19:14 wazuh-alerts-20.json
-rw-rw-rw- 1 root root  25M Jul  1 19:14 wazuh-alerts-25.json
-rw-rw-rw- 1 root root  30M Jul  1 19:14 wazuh-alerts-30.json
-rw-rw-rw- 1 root root  35M Jul  1 19:14 wazuh-alerts-35.json
-rw-rw-rw- 1 root root  40M Jul  1 19:14 wazuh-alerts-40.json
-rw-rw-rw- 1 root root  45M Jul  1 19:14 wazuh-alerts-45.json
-rw-rw-rw- 1 root root 5.0M Jul  1 19:14 wazuh-alerts-5.json
-rw-rw-rw- 1 root root  50M Jul  1 19:14 wazuh-alerts-50.json
-rw-rw-rw- 1 root root  55M Jul  1 19:14 wazuh-alerts-55.json
-rw-rw-rw- 1 root root  60M Jul  1 19:14 wazuh-alerts-60.json
-rw-rw-rw- 1 root root  65M Jul  1 19:14 wazuh-alerts-65.json
-rw-rw-rw- 1 root root  70M Jul  1 19:14 wazuh-alerts-70.json
-rw-rw-rw- 1 root root  75M Jul  1 19:14 wazuh-alerts-75.json
-rw-rw-rw- 1 root root  80M Jul  1 19:14 wazuh-alerts-80.json
-rw-rw-rw- 1 root root  85M Jul  1 19:14 wazuh-alerts-85.json
-rw-rw-rw- 1 root root  90M Jul  1 19:14 wazuh-alerts-90.json
-rw-rw-rw- 1 root root  95M Jul  1 19:14 wazuh-alerts-95.json
-rw-rw-rw- 1 root root 141K Jun 27 11:28 wazuh-alerts.json
-rw-rw-rw- 1 root root 6.2K Jul  1 20:36 workload.json

The workload.json and test_procedures/default.json files were updated to run bulk indexing tests for each of these files.

workload.json

```json {% import "benchmark.helpers" as benchmark with context %} { "version": 2, "description": "Tracker-generated workload for wazuh-alerts", "indices": [ { "name": "wazuh-alerts-5", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-10", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-15", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-20", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-25", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-30", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-35", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-40", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-45", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-50", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-55", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-60", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-65", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-70", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-75", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-80", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-85", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-90", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-95", "body": "wazuh-alerts.json" }, { "name": "wazuh-alerts-100", "body": "wazuh-alerts.json" } ], "corpora": [ { "name": "wazuh-alerts-5", "documents": [ { "target-index": "wazuh-alerts-5", "source-file": "wazuh-alerts-5.json", "document-count": 3426 } ] }, { "name": "wazuh-alerts-10", "documents": [ { "target-index": "wazuh-alerts-10", "source-file": "wazuh-alerts-10.json", "document-count": 6616 } ] }, { "name": "wazuh-alerts-15", "documents": [ { "target-index": "wazuh-alerts-15", "source-file": "wazuh-alerts-15.json", "document-count": 9958 } ] }, { "name": "wazuh-alerts-20", "documents": [ { "target-index": "wazuh-alerts-20", "source-file": "wazuh-alerts-20.json", "document-count": 13933 } ] }, { "name": "wazuh-alerts-25", "documents": [ { "target-index": "wazuh-alerts-25", "source-file": "wazuh-alerts-25.json", "document-count": 17180 } ] }, { "name": "wazuh-alerts-30", "documents": [ { "target-index": "wazuh-alerts-30", "source-file": "wazuh-alerts-30.json", "document-count": 20404 } ] }, { "name": "wazuh-alerts-35", "documents": [ { "target-index": "wazuh-alerts-35", "source-file": "wazuh-alerts-35.json", "document-count": 23737 } ] }, { "name": "wazuh-alerts-40", "documents": [ { "target-index": "wazuh-alerts-40", "source-file": "wazuh-alerts-40.json", "document-count": 27706 } ] }, { "name": "wazuh-alerts-45", "documents": [ { "target-index": "wazuh-alerts-45", "source-file": "wazuh-alerts-45.json", "document-count": 30998 } ] }, { "name": "wazuh-alerts-50", "documents": [ { "target-index": "wazuh-alerts-50", "source-file": "wazuh-alerts-50.json", "document-count": 34187 } ] }, { "name": "wazuh-alerts-55", "documents": [ { "target-index": "wazuh-alerts-55", "source-file": "wazuh-alerts-55.json", "document-count": 37774 } ] }, { "name": "wazuh-alerts-60", "documents": [ { "target-index": "wazuh-alerts-60", "source-file": "wazuh-alerts-60.json", "document-count": 41473 } ] }, { "name": "wazuh-alerts-65", "documents": [ { "target-index": "wazuh-alerts-65", "source-file": "wazuh-alerts-65.json", "document-count": 44729 } ] }, { "name": "wazuh-alerts-70", "documents": [ { "target-index": "wazuh-alerts-70", "source-file": "wazuh-alerts-70.json", "document-count": 47947 } ] }, { "name": "wazuh-alerts-75", "documents": [ { "target-index": "wazuh-alerts-75", "source-file": "wazuh-alerts-75.json", "document-count": 51993 } ] }, { "name": "wazuh-alerts-80", "documents": [ { "target-index": "wazuh-alerts-80", "source-file": "wazuh-alerts-80.json", "document-count": 55225 } ] }, { "name": "wazuh-alerts-85", "documents": [ { "target-index": "wazuh-alerts-85", "source-file": "wazuh-alerts-85.json", "document-count": 58442 } ] }, { "name": "wazuh-alerts-90", "documents": [ { "target-index": "wazuh-alerts-90", "source-file": "wazuh-alerts-90.json", "document-count": 61854 } ] }, { "name": "wazuh-alerts-95", "documents": [ { "target-index": "wazuh-alerts-95", "source-file": "wazuh-alerts-95.json", "document-count": 65786 } ] }, { "name": "wazuh-alerts-100", "documents": [ { "target-index": "wazuh-alerts-100", "source-file": "wazuh-alerts-100.json", "document-count": 69053 } ] } ], "test_procedures": [ {{ benchmark.collect(parts="test_procedures/*.json") }} ] } ```

test_procedures/default.json

```json { "name": "Wazuh Alerts Ingestion Test", "description": "Test ingestion in 5MB increments", "default": true, "schedule": [ { "operation": { "name": "bulk-index-5-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-5", "bulk-size": 3426 } }, { "operation": { "name": "bulk-index-10-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-10", "bulk-size": 6616 } }, { "operation": { "name": "bulk-index-15-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-15", "bulk-size": 9958 } }, { "operation": { "name": "bulk-index-20-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-20", "bulk-size": 13933 } }, { "operation": { "name": "bulk-index-25-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-25", "bulk-size": 17180 } }, { "operation": { "name": "bulk-index-30-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-30", "bulk-size": 20404 } }, { "operation": { "name": "bulk-index-35-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-35", "bulk-size": 23737 } }, { "operation": { "name": "bulk-index-40-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-40", "bulk-size": 27706 } }, { "operation": { "name": "bulk-index-45-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-45", "bulk-size": 30998 } }, { "operation": { "name": "bulk-index-50-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-50", "bulk-size": 34187 } }, { "operation": { "name": "bulk-index-55-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-55", "bulk-size": 37774 } }, { "operation": { "name": "bulk-index-60-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-60", "bulk-size": 41473 } }, { "operation": { "name": "bulk-index-65-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-65", "bulk-size": 44729 } }, { "operation": { "name": "bulk-index-70-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-70", "bulk-size": 47947 } }, { "operation": { "name": "bulk-index-75-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-75", "bulk-size": 51993 } }, { "operation": { "name": "bulk-index-80-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-80", "bulk-size": 55225 } }, { "operation": { "name": "bulk-index-85-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-85", "bulk-size": 58442 } }, { "operation": { "name": "bulk-index-90-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-90", "bulk-size": 61854 } }, { "operation": { "name": "bulk-index-95-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-95", "bulk-size": 65786 } }, { "operation": { "name": "bulk-index-100-mb", "operation-type": "bulk", "corpora": "wazuh-alerts-100", "bulk-size": 69053 } } ] } ```

Local results

```shell root@os-benchmarks:~# docker compose up opensearch-benchmark [+] Running 3/0 ✔ Container opensearch-node1 Running 0.0s ✔ Container permissions-setter Created 0.0s ✔ Container opensearch-benchmark Created 0.0s Attaching to opensearch-benchmark opensearch-benchmark | opensearch-benchmark | ____ _____ __ ____ __ __ opensearch-benchmark | / __ \____ ___ ____ / ___/___ ____ ___________/ /_ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__ opensearch-benchmark | / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/ opensearch-benchmark | / /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,< opensearch-benchmark | \____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_| opensearch-benchmark | /_/ opensearch-benchmark | opensearch-benchmark | [INFO] [Test Execution ID]: 83bb247f-44e8-46f0-a763-73f10b6d4577 opensearch-benchmark | [INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds. opensearch-benchmark | [WARNING] merges_total_time is 1059 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading. opensearch-benchmark | [WARNING] indexing_total_time is 16708 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading. opensearch-benchmark | [WARNING] refresh_total_time is 11434 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading. opensearch-benchmark | [WARNING] flush_total_time is 38 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading. opensearch-benchmark | Running bulk-index-5-mb [100% done] Running bulk-index-10-mb [100% done] Running bulk-index-15-mb [100% done] Running bulk-index-20-mb [100% done] Running bulk-index-25-mb [100% done] Running bulk-index-30-mb [100% done] Running bulk-index-35-mb [100% done] Running bulk-index-40-mb [100% done] Running bulk-index-45-mb [100% done] opensearch-benchmark | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=56497001, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=56497001, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429}) Running bulk-index-50-mb [100% done] opensearch-benchmark | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=62166679, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=62166679, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429}) Running bulk-index-55-mb [100% done] opensearch-benchmark | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [516691048/492.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [387461696/369.5mb], new bytes reserved: [129229352/123.2mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=129229352/123.2mb]', 'bytes_wanted': 516691048, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [516691048/492.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [387461696/369.5mb], new bytes reserved: [129229352/123.2mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=129229352/123.2mb]', 'bytes_wanted': 516691048, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429}) Running bulk-index-60-mb [100% done] opensearch-benchmark | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=73480223, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=73480223, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429}) Running bulk-index-65-mb [100% done] opensearch-benchmark | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [541932828/516.8mb], which is larger than the limit of [510027366/486.3mb], real usage: [391200848/373mb], new bytes reserved: [150731980/143.7mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=150731980/143.7mb]', 'bytes_wanted': 541932828, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [541932828/516.8mb], which is larger than the limit of [510027366/486.3mb], real usage: [391200848/373mb], new bytes reserved: [150731980/143.7mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=150731980/143.7mb]', 'bytes_wanted': 541932828, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429}) Running bulk-index-70-mb [100% done] opensearch-benchmark | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [555845794/530mb], which is larger than the limit of [510027366/486.3mb], real usage: [394297376/376mb], new bytes reserved: [161548418/154mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=161548418/154mb]', 'bytes_wanted': 555845794, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [555845794/530mb], which is larger than the limit of [510027366/486.3mb], real usage: [394297376/376mb], new bytes reserved: [161548418/154mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=161548418/154mb]', 'bytes_wanted': 555845794, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429}) Running bulk-index-75-mb [100% done] opensearch-benchmark | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [572270300/545.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [399970840/381.4mb], new bytes reserved: [172299460/164.3mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=172299460/164.3mb]', 'bytes_wanted': 572270300, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [572270300/545.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [399970840/381.4mb], new bytes reserved: [172299460/164.3mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=172299460/164.3mb]', 'bytes_wanted': 572270300, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429}) Running bulk-index-80-mb [100% done] opensearch-benchmark | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [591658024/564.2mb], which is larger than the limit of [510027366/486.3mb], real usage: [408608040/389.6mb], new bytes reserved: [183049984/174.5mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=183049984/174.5mb]', 'bytes_wanted': 591658024, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [591658024/564.2mb], which is larger than the limit of [510027366/486.3mb], real usage: [408608040/389.6mb], new bytes reserved: [183049984/174.5mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=183049984/174.5mb]', 'bytes_wanted': 591658024, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429}) Running bulk-index-85-mb [100% done] opensearch-benchmark | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=101732377, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=101732377, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429}) Running bulk-index-90-mb [100% done] opensearch-benchmark | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [526842176/502.4mb], which is larger than the limit of [510027366/486.3mb], real usage: [322218352/307.2mb], new bytes reserved: [204623824/195.1mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=204623824/195.1mb]', 'bytes_wanted': 526842176, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [] would be [526842176/502.4mb], which is larger than the limit of [510027366/486.3mb], real usage: [322218352/307.2mb], new bytes reserved: [204623824/195.1mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=204623824/195.1mb]', 'bytes_wanted': 526842176, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429}) Running bulk-index-95-mb [100% done] opensearch-benchmark | [ERROR] Running bulk-index-100-mb [100% done][INFO] Executing test with workload [wazuh-alerts], test_procedure [Wazuh Alerts Ingestion Test] and provision_config_instance ['external'] with version [2.14.0]. opensearch-benchmark | opensearch-benchmark | opensearch-benchmark | ------------------------------------------------------ opensearch-benchmark | _______ __ _____ opensearch-benchmark | / ____(_)___ ____ _/ / / ___/_________ ________ opensearch-benchmark | / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ opensearch-benchmark | / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ opensearch-benchmark | /_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ opensearch-benchmark | ------------------------------------------------------ opensearch-benchmark | opensearch-benchmark | | Metric | Task | Value | Unit | opensearch-benchmark | |---------------------------------------------------------------:|------------------:|------------:|-------:| opensearch-benchmark | | Cumulative indexing time of primary shards | | 0.512283 | min | opensearch-benchmark | | Min cumulative indexing time across primary shards | | 0 | min | opensearch-benchmark | | Median cumulative indexing time across primary shards | | 0.0343667 | min | opensearch-benchmark | | Max cumulative indexing time across primary shards | | 0.0954833 | min | opensearch-benchmark | | Cumulative indexing throttle time of primary shards | | 0 | min | opensearch-benchmark | | Min cumulative indexing throttle time across primary shards | | 0 | min | opensearch-benchmark | | Median cumulative indexing throttle time across primary shards | | 0 | min | opensearch-benchmark | | Max cumulative indexing throttle time across primary shards | | 0 | min | opensearch-benchmark | | Cumulative merge time of primary shards | | 0.01765 | min | opensearch-benchmark | | Cumulative merge count of primary shards | | 74 | | opensearch-benchmark | | Min cumulative merge time across primary shards | | 0 | min | opensearch-benchmark | | Median cumulative merge time across primary shards | | 0 | min | opensearch-benchmark | | Max cumulative merge time across primary shards | | 0.01765 | min | opensearch-benchmark | | Cumulative merge throttle time of primary shards | | 0 | min | opensearch-benchmark | | Min cumulative merge throttle time across primary shards | | 0 | min | opensearch-benchmark | | Median cumulative merge throttle time across primary shards | | 0 | min | opensearch-benchmark | | Max cumulative merge throttle time across primary shards | | 0 | min | opensearch-benchmark | | Cumulative refresh time of primary shards | | 0.223817 | min | opensearch-benchmark | | Cumulative refresh count of primary shards | | 787 | | opensearch-benchmark | | Min cumulative refresh time across primary shards | | 0 | min | opensearch-benchmark | | Median cumulative refresh time across primary shards | | 0.008775 | min | opensearch-benchmark | | Max cumulative refresh time across primary shards | | 0.103083 | min | opensearch-benchmark | | Cumulative flush time of primary shards | | 0.00116667 | min | opensearch-benchmark | | Cumulative flush count of primary shards | | 5 | | opensearch-benchmark | | Min cumulative flush time across primary shards | | 0 | min | opensearch-benchmark | | Median cumulative flush time across primary shards | | 0 | min | opensearch-benchmark | | Max cumulative flush time across primary shards | | 0.000533333 | min | opensearch-benchmark | | Total Young Gen GC time | | 0.502 | s | opensearch-benchmark | | Total Young Gen GC count | | 116 | | opensearch-benchmark | | Total Old Gen GC time | | 0 | s | opensearch-benchmark | | Total Old Gen GC count | | 0 | | opensearch-benchmark | | Store size | | 0.268493 | GB | opensearch-benchmark | | Translog size | | 0.463153 | GB | opensearch-benchmark | | Heap used for segments | | 0 | MB | opensearch-benchmark | | Heap used for doc values | | 0 | MB | opensearch-benchmark | | Heap used for terms | | 0 | MB | opensearch-benchmark | | Heap used for norms | | 0 | MB | opensearch-benchmark | | Heap used for points | | 0 | MB | opensearch-benchmark | | Heap used for stored fields | | 0 | MB | opensearch-benchmark | | Segment count | | 58 | | opensearch-benchmark | | Min Throughput | bulk-index-5-mb | 7803.9 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-5-mb | 7803.9 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-5-mb | 7803.9 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-5-mb | 7803.9 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-5-mb | 429.577 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-5-mb | 429.577 | ms | opensearch-benchmark | | error rate | bulk-index-5-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-10-mb | 8445.34 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-10-mb | 8445.34 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-10-mb | 8445.34 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-10-mb | 8445.34 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-10-mb | 774.252 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-10-mb | 774.252 | ms | opensearch-benchmark | | error rate | bulk-index-10-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-15-mb | 9598 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-15-mb | 9598 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-15-mb | 9598 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-15-mb | 9598 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-15-mb | 1028.43 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-15-mb | 1028.43 | ms | opensearch-benchmark | | error rate | bulk-index-15-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-20-mb | 9030.53 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-20-mb | 9030.53 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-20-mb | 9030.53 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-20-mb | 9030.53 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-20-mb | 1530.53 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-20-mb | 1530.53 | ms | opensearch-benchmark | | error rate | bulk-index-20-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-25-mb | 9354.05 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-25-mb | 9354.05 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-25-mb | 9354.05 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-25-mb | 9354.05 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-25-mb | 1821.08 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-25-mb | 1821.08 | ms | opensearch-benchmark | | error rate | bulk-index-25-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-30-mb | 10155.6 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-30-mb | 10155.6 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-30-mb | 10155.6 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-30-mb | 10155.6 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-30-mb | 1992.57 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-30-mb | 1992.57 | ms | opensearch-benchmark | | error rate | bulk-index-30-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-35-mb | 9319.89 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-35-mb | 9319.89 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-35-mb | 9319.89 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-35-mb | 9319.89 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-35-mb | 2526.72 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-35-mb | 2526.72 | ms | opensearch-benchmark | | error rate | bulk-index-35-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-40-mb | 8431.44 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-40-mb | 8431.44 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-40-mb | 8431.44 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-40-mb | 8431.44 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-40-mb | 3258.53 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-40-mb | 3258.53 | ms | opensearch-benchmark | | error rate | bulk-index-40-mb | 0 | % | opensearch-benchmark | | Min Throughput | bulk-index-45-mb | 8907.38 | docs/s | opensearch-benchmark | | Mean Throughput | bulk-index-45-mb | 8907.38 | docs/s | opensearch-benchmark | | Median Throughput | bulk-index-45-mb | 8907.38 | docs/s | opensearch-benchmark | | Max Throughput | bulk-index-45-mb | 8907.38 | docs/s | opensearch-benchmark | | 100th percentile latency | bulk-index-45-mb | 3453.96 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-45-mb | 3453.96 | ms | opensearch-benchmark | | error rate | bulk-index-45-mb | 0 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-50-mb | 225.545 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-50-mb | 225.545 | ms | opensearch-benchmark | | error rate | bulk-index-50-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-55-mb | 210.582 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-55-mb | 210.582 | ms | opensearch-benchmark | | error rate | bulk-index-55-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-60-mb | 194.882 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-60-mb | 194.882 | ms | opensearch-benchmark | | error rate | bulk-index-60-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-65-mb | 243.262 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-65-mb | 243.262 | ms | opensearch-benchmark | | error rate | bulk-index-65-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-70-mb | 209.641 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-70-mb | 209.641 | ms | opensearch-benchmark | | error rate | bulk-index-70-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-75-mb | 262.526 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-75-mb | 262.526 | ms | opensearch-benchmark | | error rate | bulk-index-75-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-80-mb | 252.767 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-80-mb | 252.767 | ms | opensearch-benchmark | | error rate | bulk-index-80-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-85-mb | 262.211 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-85-mb | 262.211 | ms | opensearch-benchmark | | error rate | bulk-index-85-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-90-mb | 320.98 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-90-mb | 320.98 | ms | opensearch-benchmark | | error rate | bulk-index-90-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-95-mb | 300.169 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-95-mb | 300.169 | ms | opensearch-benchmark | | error rate | bulk-index-95-mb | 100 | % | opensearch-benchmark | | 100th percentile latency | bulk-index-100-mb | 165.695 | ms | opensearch-benchmark | | 100th percentile service time | bulk-index-100-mb | 165.695 | ms | opensearch-benchmark | | error rate | bulk-index-100-mb | 100 | % | opensearch-benchmark | opensearch-benchmark | opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-50-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-50-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-55-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-55-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-60-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-60-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-65-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-65-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-70-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-70-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-75-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-75-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-80-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-80-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-85-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-85-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-90-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-90-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-95-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-95-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | [WARNING] Error rate is 100.0 for operation 'bulk-index-100-mb'. Please check the logs. opensearch-benchmark | [WARNING] No throughput metrics available for [bulk-index-100-mb]. Likely cause: Error rate is 100.0%. Please check the logs. opensearch-benchmark | opensearch-benchmark | --------------------------------- opensearch-benchmark | [INFO] SUCCESS (took 136 seconds) opensearch-benchmark | --------------------------------- opensearch-benchmark exited with code 0 ```

Operations above 50MB return a 429 error code (too many requests) so I need to tweak this a little further to make sure I'm giving the cluster enough time to process the request.

So far, I've only tested this on my local dockerized environment, but I have the EC2 infrastructure ready to run the tests as soon as I've refined the workloads to include shard allocation and proper warm-up and clean up stages.

f-galland commented 2 months ago

Setup:

The benchmarks were run on top of 3 EC2 instances with 16GB RAM and a 8 core, 2200MHz AMD EPYC 7571 processor. Each node had the docker backend installed, and was controlled using a context from my local machine.

docker context ls

```shell $ docker context ls NAME DESCRIPTION DOCKER ENDPOINT ERROR benchmark ssh://root@benchmark default * Current DOCKER_HOST based configuration unix:///var/run/docker.sock node-1 ssh://root@benchmark-node1 node-2 ssh://root@benchmark-node2 node-3 ssh://root@benchmark-node3 ```

Each node had its own docker compose:

node-1.yml

```yml services: opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/) image: opensearchproject/opensearch:2.14.0 container_name: opensearch-node1 hostname: opensearch-node1 environment: - NODE1_LOCAL_IP=${NODE1_LOCAL_IP} - NODE2_LOCAL_IP=${NODE2_LOCAL_IP} - NODE3_LOCAL_IP=${NODE3_LOCAL_IP} - cluster.name=opensearch-cluster # Name the cluster - network.publish_host=${NODE1_LOCAL_IP} - http.publish_host=${NODE1_LOCAL_IP} - transport.publish_host=${NODE1_LOCAL_IP} - node.name=opensearch-node1 # Name the node that will run in this container - discovery.seed_hosts=${NODE1_LOCAL_IP},${NODE2_LOCAL_IP},${NODE3_LOCAL_IP}, # Nodes to look for when discovering the cluster - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2,opensearch-node3 # Nodes eligibile to serve as cluster manager - bootstrap.memory_lock=true # Disable JVM heap memory swapping - "OPENSEARCH_JAVA_OPTS=-Xms8g -Xmx8g" # Set min and max JVM heap sizes to at least 50% of system RAM - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} # Sets the demo admin user password when using demo configuration (for OpenSearch 2.12 and later) ulimits: memlock: soft: -1 # Set memlock to unlimited (no soft or hard limit) hard: -1 nofile: soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536 hard: 65536 volumes: - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container #healthcheck: # test: curl -sku admin:${OPENSEARCH_INITIAL_ADMIN_PASSWORD} https://opensearch-node1:9200/_cat/health | grep -q opensearch-cluster # start_period: 10s # start_interval: 3s ports: - 9200:9200 # REST API - 9300:9300 # REST API - 9600:9600 # Performance Analyzer networks: - opensearch-net # All of the containers will join the same Docker bridge network opensearch-dashboards: image: opensearchproject/opensearch-dashboards:2.14.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes container_name: opensearch-dashboards #depends_on: # opensearch-node1: # condition: service_healthy ports: - 5601:5601 # Map host port 5601 to container port 5601 expose: - "5601" # Expose port 5601 for web access to OpenSearch Dashboards environment: - NODE1_LOCAL_IP=${NODE1_LOCAL_IP} - NODE2_LOCAL_IP=${NODE2_LOCAL_IP} - NODE3_LOCAL_IP=${NODE3_LOCAL_IP} - OPENSEARCH_HOSTS=["https://${NODE1_LOCAL_IP}:9200","https://${NODE2_LOCAL_IP}:9200","https://${NODE3_LOCAL_IP}:9200"] networks: - opensearch-net volumes: opensearch-data1: networks: opensearch-net: ```

node-2.yml

```yml services: opensearch-node2: image: opensearchproject/opensearch:2.14.0 # This should be the same image used for opensearch-node1 to avoid issues container_name: opensearch-node2 hostname: opensearch-node2 environment: - NODE1_LOCAL_IP=${NODE1_LOCAL_IP} - NODE2_LOCAL_IP=${NODE2_LOCAL_IP} - NODE3_LOCAL_IP=${NODE3_LOCAL_IP} - cluster.name=opensearch-cluster - network.publish_host=${NODE2_LOCAL_IP} - http.publish_host=${NODE2_LOCAL_IP} - transport.publish_host=${NODE2_LOCAL_IP} - node.name=opensearch-node2 - discovery.seed_hosts=${NODE1_LOCAL_IP},${NODE2_LOCAL_IP},${NODE3_LOCAL_IP}, # Nodes to look for when discovering the cluster - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2,opensearch-node3 - bootstrap.memory_lock=true - "OPENSEARCH_JAVA_OPTS=-Xms8g -Xmx8g" - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} ports: - 9200:9200 - 9300:9300 - 9600:9600 ulimits: memlock: soft: -1 hard: -1 nofile: soft: 65536 hard: 65536 volumes: - opensearch-data2:/usr/share/opensearch/data networks: - opensearch-net volumes: opensearch-data2: networks: opensearch-net: ```

node-3.yml

```yml services: opensearch-node3: image: opensearchproject/opensearch:2.14.0 # This should be the same image used for opensearch-node1 to avoid issues container_name: opensearch-node3 hostname: opensearch-node3 environment: - NODE1_LOCAL_IP=${NODE1_LOCAL_IP} - NODE2_LOCAL_IP=${NODE2_LOCAL_IP} - NODE3_LOCAL_IP=${NODE3_LOCAL_IP} - network.publish_host=${NODE3_LOCAL_IP} - http.publish_host=${NODE3_LOCAL_IP} - transport.publish_host=${NODE3_LOCAL_IP} - cluster.name=opensearch-cluster - node.name=opensearch-node3 - discovery.seed_hosts=${NODE1_LOCAL_IP},${NODE2_LOCAL_IP},${NODE3_LOCAL_IP}, # Nodes to look for when discovering the cluster - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2,opensearch-node3 - bootstrap.memory_lock=true - "OPENSEARCH_JAVA_OPTS=-Xms8g -Xmx8g" - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} ports: - 9200:9200 - 9300:9300 - 9600:9600 ulimits: memlock: soft: -1 hard: -1 nofile: soft: 65536 hard: 65536 volumes: - opensearch-data3:/usr/share/opensearch/data networks: - opensearch-net volumes: opensearch-data3: networks: opensearch-net: ```

The cluster itself was brought up from a script in my local machine (making use of the remote contexts) for convenience:

cluster.sh

```shell #!/bin/bash case $1 in down) for i in {1..3} do echo "Bringing node-$i down" docker --context=node-$i compose -f node-$i.yml down -v done ;; up) for i in {1..3} do echo "Bringing node-$i up" docker --context=node-$i compose -f node-$i.yml up -d done ;; logs) docker --context=node-$2 logs opensearch-node$2 ;; ps) docker --context=node-$2 ps -a ;; run) docker --context=benchmark compose -f benchmark.yml up -d ;; results) docker --context=benchmark logs opensearch-benchmark -f ;; *) echo "Unrecognized option" ;; esac exit 0 ```

Lastly, a 4th ec2 instance was used to run the actual benchmark from the following docker compose:

docker-compose.yml

```yml services: opensearch-benchmark: image: opensearchproject/opensearch-benchmark:1.6.0 hostname: opensearch-benchmark container_name: opensearch-benchmark volumes: - /root/benchmarks:/opensearch-benchmark/.benchmark environment: - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} - BENCHMARK_NAME=${BENCHMARK_NAME} - NODE1_LOCAL_IP=${NODE1_LOCAL_IP} - NODE2_LOCAL_IP=${NODE2_LOCAL_IP} - NODE3_LOCAL_IP=${NODE3_LOCAL_IP} #command: execute-test --target-hosts https://opensearch-node1:9200 --pipeline benchmark-only --workload geonames --client-options basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false --test-mode command: execute-test --pipeline="benchmark-only" --workload-path="/opensearch-benchmark/.benchmark/${BENCHMARK_NAME}" --target-hosts="https://${NODE1_LOCAL_IP}:9200,https://${NODE2_LOCAL_IP}:9200,https://${NODE3_LOCAL_IP}:9200" --client-options="basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false" networks: - opensearch-net # All of the containers will join the same Docker bridge network permissions-setter: image: alpine:3.14 container_name: permissions-setter volumes: - /root/benchmarks:/benchmark entrypoint: /bin/sh command: > -c ' chmod -R a+rw /benchmark ' opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/) image: opensearchproject/opensearch:2.14.0 container_name: opensearch-node1 hostname: opensearch-node1 environment: - cluster.name=opensearch-cluster # Name the cluster - node.name=opensearch-node1 # Name the node that will run in this container - cluster.initial_cluster_manager_nodes=opensearch-node1 # Nodes eligibile to serve as cluster manager - bootstrap.memory_lock=true # Disable JVM heap memory swapping - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} # Sets the demo admin user password when using demo configuration (for OpenSearch 2.12 and later) ulimits: memlock: soft: -1 # Set memlock to unlimited (no soft or hard limit) hard: -1 nofile: soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536 hard: 65536 volumes: - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container #healthcheck: # test: curl -sku admin:${OPENSEARCH_INITIAL_ADMIN_PASSWORD} https://opensearch-node1:9200/_cat/health | grep -q opensearch-cluster # start_period: 10s # start_interval: 3s ports: - 9200:9200 # REST API - 9300:9300 # REST API - 9600:9600 # Performance Analyzer networks: - opensearch-net # All of the containers will join the same Docker bridge network opensearch-dashboards: image: opensearchproject/opensearch-dashboards:2.14.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes container_name: opensearch-dashboards #depends_on: # opensearch-node1: # condition: service_healthy ports: - 5601:5601 # Map host port 5601 to container port 5601 expose: - "5601" # Expose port 5601 for web access to OpenSearch Dashboards environment: - OPENSEARCH_HOSTS=["https://opensearch-node1:9200"] networks: - opensearch-net volumes: opensearch-data1: networks: opensearch-net: ```

I initially also included an opensearch node in this machine because opensearch-benchmark allows its tests' output to be directed to an opensearch cluster for analysis/visualization, but later I opted by using the csv output instead.

Benchmark files:

In order to create benchmark files I downloaded my wazuh-alerts-* indices from an existing Wazuh installation, and consolidated them in a jsonfile. I trimmed the file to be 20MB in length and then created smaller versions of it with sizes ranging from 1-20MB.

This was done using a simple bash script:

generate_files.sh

```shell #!/bin/bash PREFIX="wazuh-alerts" SOURCE_FILE="$PREFIX-100.json" MB_TO_B=1048576 for i in {1..20} do MB_SIZE=$i SIZE=$(( $MB_SIZE * $MB_TO_B )) FILENAME="$PREFIX-$MB_SIZE.json" head -c $SIZE $SOURCE_FILE > $FILENAME sed -i '$ d' $FILENAME done ```

The rest of the configuration files for the benchmark were created using the following bash script (which admittedly is a little rough around the edges, but still works):

generate_config.sh

```shell #!/bin/bash PREFIX="wazuh-alerts" SOURCE_FILE="$PREFIX-20.json" MB_TO_B=1048576 CORPORAE=() TEST_PROCEDURES=() INDICES=() OPERATIONS=() PARALLEL_JOBS=4 SINGLE_BULK_TEST=() PARALLEL_BULK_TEST=() TASKS=() CLIENTS=2 for i in {01..20} do MB_SIZE=$i NAME="$PREFIX-$MB_SIZE" FILENAME="$NAME.json" DOCUMENT_COUNT=$(wc -l $FILENAME | cut -d' ' -f1) OPERATION_NAME="${MB_SIZE}MB-bulk" CORPORAE+=" { \"name\": \"$NAME\", \"documents\": [ { \"target-index\": \"$NAME\", \"source-file\": \"$FILENAME\", \"document-count\": $DOCUMENT_COUNT } ] }," SINGLE_BULK_TEST+=" { \"operation\": \"$OPERATION_NAME\", \"clients\": $CLIENTS }," OPERATIONS+=" { \"name\": \"$OPERATION_NAME\", \"operation-type\": \"bulk\", \"corpora\": \"$NAME\", \"bulk-size\": $DOCUMENT_COUNT }," INDICES+=" { \"name\": \"$NAME\", \"body\": \"${PREFIX}.json\" }," done SINGLE_BULK_TEST=${SINGLE_BULK_TEST%%,} TEST_PROCEDURES+=" { \"name\": \"single-bulk-index-test\", \"description\": \"Wazuh Alerts bulk index test\", \"default\": true, \"schedule\": [ ${SINGLE_BULK_TEST} ] }," for i in {01..05} do MB_SIZE=$i OPERATION_NAME="${MB_SIZE}MB-bulk" TASKS=() for j in $(seq --format="%02g" 1 ${PARALLEL_JOBS}) do TASKS+=" { \"name\": \"parallel-test-${i}-thread-${j}\", \"operation\": \"$OPERATION_NAME\", \"clients\": $CLIENTS }," done TASKS=${TASKS%%,} PARALLEL_BULK_TEST+=" { \"parallel\": { \"tasks\": [ ${TASKS} ] } }," done PARALLEL_BULK_TEST=${PARALLEL_BULK_TEST%%,} TEST_PROCEDURES+=" { \"name\": \"parallel-bulk-index-test\", \"description\": \"Test using ${PARALLEL_JOBS} parallel indexing operations\", \"schedule\": [ ${PARALLEL_BULK_TEST} ] }," CORPORAE=${CORPORAE%%,} OPERATIONS=${OPERATIONS%%,} TEST_PROCEDURES=${TEST_PROCEDURES%%,} INDICES=${INDICES%%,} OLDIFS=$IFS IFS=$'`' WORKLOAD=" {% import \"benchmark.helpers\" as benchmark with context %} { \"version\": 2, \"description\": \"Wazuh Indexer Bulk Benchmarks\", \"indices\": [ ${INDICES[@]} ], \"corpora\": [ ${CORPORAE[@]} ], \"operations\": [ {{ benchmark.collect(parts=\"operations/*.json\") }} ], \"test_procedures\": [ {{ benchmark.collect(parts=\"test_procedures/*.json\") }} ] } " mkdir -p ./operations mkdir -p ./test_procedures echo ${OPERATIONS[@]} > ./operations/default.json echo ${TEST_PROCEDURES[@]} > ./test_procedures/default.json echo ${WORKLOAD[@]} > ./workload.json IFS=$OLDIFS ```

This script generates a workload.json, an operations/default.json and a test_procedures/default.json files, necessary for the benchmark to run.

Tests

The nature of the benchmark itself can be assessed by looking at the output test_procedures/default.json file.

test_procedures/default.json

```json { "name": "single-bulk-index-test", "description": "Wazuh Alerts bulk index test", "default": true, "schedule": [ { "operation": "01MB-bulk", "clients": 2 }, { "operation": "02MB-bulk", "clients": 2 }, { "operation": "03MB-bulk", "clients": 2 }, { "operation": "04MB-bulk", "clients": 2 }, { "operation": "05MB-bulk", "clients": 2 }, { "operation": "06MB-bulk", "clients": 2 }, { "operation": "07MB-bulk", "clients": 2 }, { "operation": "08MB-bulk", "clients": 2 }, { "operation": "09MB-bulk", "clients": 2 }, { "operation": "10MB-bulk", "clients": 2 }, { "operation": "11MB-bulk", "clients": 2 }, { "operation": "12MB-bulk", "clients": 2 }, { "operation": "13MB-bulk", "clients": 2 }, { "operation": "14MB-bulk", "clients": 2 }, { "operation": "15MB-bulk", "clients": 2 }, { "operation": "16MB-bulk", "clients": 2 }, { "operation": "17MB-bulk", "clients": 2 }, { "operation": "18MB-bulk", "clients": 2 }, { "operation": "19MB-bulk", "clients": 2 }, { "operation": "20MB-bulk", "clients": 2 } ] }, { "name": "parallel-bulk-index-test", "description": "Test using 4 parallel indexing operations", "schedule": [ { "parallel": { "tasks": [ { "name": "parallel-test-01-thread-01", "operation": "01MB-bulk", "clients": 2 }, { "name": "parallel-test-01-thread-02", "operation": "01MB-bulk", "clients": 2 }, { "name": "parallel-test-01-thread-03", "operation": "01MB-bulk", "clients": 2 }, { "name": "parallel-test-01-thread-04", "operation": "01MB-bulk", "clients": 2 } ] } }, { "parallel": { "tasks": [ { "name": "parallel-test-02-thread-01", "operation": "02MB-bulk", "clients": 2 }, { "name": "parallel-test-02-thread-02", "operation": "02MB-bulk", "clients": 2 }, { "name": "parallel-test-02-thread-03", "operation": "02MB-bulk", "clients": 2 }, { "name": "parallel-test-02-thread-04", "operation": "02MB-bulk", "clients": 2 } ] } }, { "parallel": { "tasks": [ { "name": "parallel-test-03-thread-01", "operation": "03MB-bulk", "clients": 2 }, { "name": "parallel-test-03-thread-02", "operation": "03MB-bulk", "clients": 2 }, { "name": "parallel-test-03-thread-03", "operation": "03MB-bulk", "clients": 2 }, { "name": "parallel-test-03-thread-04", "operation": "03MB-bulk", "clients": 2 } ] } }, { "parallel": { "tasks": [ { "name": "parallel-test-04-thread-01", "operation": "04MB-bulk", "clients": 2 }, { "name": "parallel-test-04-thread-02", "operation": "04MB-bulk", "clients": 2 }, { "name": "parallel-test-04-thread-03", "operation": "04MB-bulk", "clients": 2 }, { "name": "parallel-test-04-thread-04", "operation": "04MB-bulk", "clients": 2 } ] } }, { "parallel": { "tasks": [ { "name": "parallel-test-05-thread-01", "operation": "05MB-bulk", "clients": 2 }, { "name": "parallel-test-05-thread-02", "operation": "05MB-bulk", "clients": 2 }, { "name": "parallel-test-05-thread-03", "operation": "05MB-bulk", "clients": 2 }, { "name": "parallel-test-05-thread-04", "operation": "05MB-bulk", "clients": 2 } ] } } ] } ```

There are two tests:

single-bulk-index-test
parallel-bulk-index-test

The first sequentially indexes data in 1 through 20MB bulks. The second one runs 4 parallel bulk indexing operations at a time, increasing the bulk size in 1MB increments.

Running the benchmark

In order to obtain a fair sample size from these tests, we considered using the iterations parameter but later found out it only really applies to read operations in general and has no effect on bulk indexing operations.

For that reason, I opted to simply launch the test repeatedly from the simplest of bash scripts:

benchmark.sh

```shell #!/bin/bash TEST="parallel-bulk-index-test" curl -sku admin:Password -XDELETE https://node-1:9200/wazuh-* curl -sku admin:Password -XPOST https://node-1:9200/_forcemerge for i in {1..100} do opensearch-benchmark execute-test --pipeline="benchmark-only" --workload-path="./benchmarks/wazuh-alerts" --target-hosts="https://node-1:9200,https://node-2:9200,https://node-3:9200" --client-options="basic_auth_user:admin,basic_auth_password:Password,verify_certs:false" --results-format csv --results-file ./${TEST}/results-$(date +%F-%T).csv --test-procedure=${TEST} curl -sku admin:Password -XDELETE https://node-1:9200/wazuh-* curl -sku admin:Password -XPOST https://node-1:9200/_forcemerge done TEST="single-bulk-index-test" for i in {1..100} do opensearch-benchmark execute-test --pipeline="benchmark-only" --workload-path="./benchmarks/wazuh-alerts" --target-hosts="https://node-1:9200,https://node2:9200,https://node3:9200" --client-options="basic_auth_user:admin,basic_auth_password:Password,verify_certs:false" --results-format csv --results-file ./${TEST}/results-$(date +%F-%T).csv --test-procedure=${TEST} curl -sku admin:Password -XDELETE https://node-1:9200/wazuh-* curl -sku admin:Password -XPOST https://node-1:9200/_forcemerge done ```

This script simply runs all the benchmarks in a loop and outputs their results to a csv file. After each pass, it deletes all the indices it created and forces a merge to clean the state of the cluster.

Results

The results are dumped and plotted to the team's drive:

https://docs.google.com/spreadsheets/d/1SDRahlnqHvNvi-_bHKSzqNpL2qVu0zc1pOXAc8eZh10/edit?usp=sharing

In the graphs above, the y axis holds the number of indexed documents per second, and the x axis the size of each bulk operation in MB for each test.

These are results are averaged from the output of running the tests 30 times, but the results for each pass don't vary a lot. So it seems that increasing the bulk size increases the throughput until we hit diminishing returns.

We chose to run this up to 20MB because it is recommended to keep bulk indexing operations below 15MB.

AlexRuiz7 commented 2 months ago

Results

We ran more benchmark tests for single and parallel bulks. The most representative data set runs an OpenSearch Benchmark workload using 1 client and 4 parallel bulk tasks, summing up to (bulk_size * threads) MB concurrently, measuring the Mean Throughput in average after 100 runs. The infrastructure uses 3 nodes of Wazuh Indexer in cluster mode, v4.8.0, using the default wazuh-alerts template: 3 primary shards and 1 replica shard.

In the charts below, we can see a clear comparison between using a single bulk request vs parallel bulks

Single bulk request Average Sum (docs_sec) vs Bulk size (MB) (1)

Parallel bulks (4) Average Sum (docs_sec) vs Bulk size (MB)

Conclusions

The parallel bulk request scenario has proven to return way higher metrics. In the table below, we can see the performance boost in ingestion metrics (ingested documents per second) parallelizing 4 bulk request vs using a single bulk request. The difference is substantial, while we can see that the performance gain tends to drop as we increase the bulk size.

On the other hand, the table shows that the trend line is strictly increasing, which demonstrates the Indexer is able to ingest more documents per second by increasing the bulk size and or increasing the parallel requests. However, we decided to stop further analysis past the 20 MB bulk size, as it's above the recommended settings by Elastic and OpenSearch. Using values higher than 15 MB is not recommended as it can make the cluster unstable. Preliminary analysis shown that we can increase this number until 50 MB. At this point, the Indexer stops responding.

Parallel / Single boost
674.77%
557.23%
410.91%

Parallel vs Single bulk requests

For best tradeoff between performance and stability, we recommend not passing the 15 MB threshold per bulk request. It's also important to note that the bulk size depends on the number of document and their size:

1,000 documents at 1 KB each is 1 MB.
1,000 documents at 100 KB each is 100 MB.

Also, the client should make sure that bulk requests are round-robined across all the data nodes, to prevent a single node from storing all the bulks in memory while processing.

References:

wazuh / wazuh-indexer