[Bug]: OSB Failing Intermittently on Elasticsearch 7.10 - OSB 1.8.0

AndreKurait commented 2 months ago

Describe the bug

We're seeing intermittent failures in our Github Actions that leverage OSB when executing against Elasticsearch 7.10 with OSB 1.8.0

2024-09-09 16:08:17 - INFO - Running opensearch-benchmark with 'nyc_taxis' workload
2024-09-09 16:08:17 - INFO - Executing command: opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=https://capture-proxy:9200 --workload=nyc_taxis --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params=target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1 --client-options=verify_certs:false,basic_auth_user:admin,basic_auth_password:********

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 6ee4f874-7bc4-4968-b1b7-192c1a54916a
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
Error:  Cannot execute-test. Worker [0] has exited prematurely.

Getting further help:
*********************
* Check the log files in /root/.benchmark/logs for errors.
* Read the documentation at https://opensearch.org/docs.
* Ask a question on the forum at https://forum.opensearch.org/.
* Raise an issue at https://github.com/opensearch-project/OpenSearch-Benchmark/issues and include the log files in /root/.benchmark/logs.

To reproduce

Running OSB with the following settings. Occasionally seeing the last execution fail with the logs attached benchmark.log

pipenv run opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1"  --client-options=$client_options &&
echo "Running opensearch-benchmark w/ 'http_logs' workload..." &&
pipenv run opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=http_logs --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options &&
echo "Running opensearch-benchmark w/ 'nested' workload..." &&
pipenv run opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=nested --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1"  --client-options=$client_options &&
echo "Running opensearch-benchmark w/ 'nyc_taxis' workload..." &&
pipenv run opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=nyc_taxis --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1"  --client-options=$client_options

Expected behavior

OSB succeeds

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

Github actions - ubuntu opensearch-benchmark 1.8.0

Additional context

No response

Relevant log output

2024-09-09 16:08:20,381 ActorAddr-(T|:36659)/PID:761 osbenchmark.actor INFO Worker[0] is executing tasks at index [3].
2024-09-09 16:08:20,395 -not-actor-/PID:701 osbenchmark.test_execution_orchestrator ERROR A benchmark failure has occurred
2024-09-09 16:08:20,396 -not-actor-/PID:701 osbenchmark.test_execution_orchestrator INFO Telling benchmark actor to exit.
2024-09-09 16:08:20,383 ActorAddr-(T|:36659)/PID:761 osbenchmark.client INFO Creating OpenSearch client connected to [{'host': 'capture-proxy', 'port': 9200, 'use_ssl': True}] with options [{'verify_certs': False, 'basic_auth_user': 'admin', 'basic_auth_password': '*****', 'max_connections': 1}]
2024-09-09 16:08:20,398 ActorAddr-(T|:43115)/PID:729 osbenchmark.actor INFO BuilderActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:34845)])
2024-09-09 16:08:20,392 ActorAddr-(T|:34845)/PID:710 osbenchmark.actor INFO Received a benchmark failure from [ActorAddr-(T|:37323)] and will forward it now.
2024-09-09 16:08:20,391 ActorAddr-(T|:37323)/PID:730 osbenchmark.actor ERROR Worker [0] has exited prematurely. Aborting benchmark.
2024-09-09 16:08:20,383 ActorAddr-(T|:36659)/PID:761 osbenchmark.client INFO SSL support: off
2024-09-09 16:08:20,383 ActorAddr-(T|:36659)/PID:761 osbenchmark.client INFO HTTP basic authentication: on
2024-09-09 16:08:20,384 ActorAddr-(T|:36659)/PID:761 osbenchmark.client INFO HTTP compression: off
2024-09-09 16:08:20,384 ActorAddr-(T|:36659)/PID:761 osbenchmark.worker_coordinator.worker_coordinator INFO Task assertions enabled: False
2024-09-09 16:08:20,385 ActorAddr-(T|:36659)/PID:761 osbenchmark.worker_coordinator.worker_coordinator INFO Choosing [unthrottled] for [create-index].
2024-09-09 16:08:20,385 ActorAddr-(T|:36659)/PID:761 osbenchmark.worker_coordinator.worker_coordinator INFO Creating iteration-count based schedule with [None] distribution for [create-index] with [0] warmup iterations and [1] iterations.
2024-09-09 16:08:20,385 ActorAddr-(T|:36659)/PID:761 osbenchmark.worker_coordinator.worker_coordinator INFO iteration-count-based schedule will determine when the schedule for [create-index] terminates.
2024-09-09 16:08:20,397 ActorAddr-(T|:34845)/PID:710 osbenchmark.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2024-09-09 16:08:20,417 ActorAddr-(T|:37323)/PID:730 osbenchmark.actor INFO Main worker_coordinator received ActorExitRequest and will terminate all load generators.
2024-09-09 16:08:20,415 ActorAddr-(T|:34845)/PID:710 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:43115)] (ignoring).
2024-09-09 16:08:20,418 ActorAddr-(T|:34845)/PID:710 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:37323)] (ignoring).
2024-09-09 16:08:23,399 -not-actor-/PID:701 osbenchmark.benchmark INFO Attempting to shutdown internal actor system.
2024-09-09 16:08:23,400 -not-actor-/PID:709 root INFO ActorSystem Logging Shutdown
2024-09-09 16:08:23,421 -not-actor-/PID:708 root INFO ---- Actor System shutdown
2024-09-09 16:08:23,421 -not-actor-/PID:701 osbenchmark.benchmark INFO Actor system is still running. Waiting...
2024-09-09 16:08:24,421 -not-actor-/PID:701 osbenchmark.benchmark INFO Shutdown completed.
2024-09-09 16:08:24,422 -not-actor-/PID:701 osbenchmark.benchmark ERROR Cannot run subcommand [execute-test].
Traceback (most recent call last):
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/benchmark.py", line 931, in dispatch_sub_command
    execute_test(cfg, args.kill_running_processes)
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/benchmark.py", line 690, in execute_test
    with_actor_system(test_execution_orchestrator.run, cfg)
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/benchmark.py", line 717, in with_actor_system
    runnable(cfg)
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/test_execution_orchestrator.py", line 381, in run
    raise e
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/test_execution_orchestrator.py", line 378, in run
    pipeline(cfg)
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/test_execution_orchestrator.py", line 69, in __call__
    self.target(cfg)
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/test_execution_orchestrator.py", line 314, in benchmark_only
    return execute_test(cfg, external=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib64/python3.11/site-packages/osbenchmark/test_execution_orchestrator.py", line 273, in execute_test
    raise exceptions.BenchmarkError(result.message, result.cause)
osbenchmark.exceptions.BenchmarkError: Worker [0] has exited prematurely.

IanHoang commented 1 month ago

@AndreKurait Some questions:

Is there a reason why you're trying to use 1 branch instead of 7 in OSB workloads?
Have you isolated the commands and run them individually? When run individually, do you still see the issue show up?

This error can occur if there is improper setup or conflicting worker processes.

AndreKurait commented 1 month ago

@IanHoang, which parameter do you mean by branch, do you mean bulk_indexing_clients:1?

Is there a different way to specify these 4 workloads?

IanHoang commented 1 month ago

@IanHoang, which parameter do you mean by branch, do you mean bulk_indexing_clients:1?

Is there a different way to specify these 4 workloads?

When you provide --distribution-version parameter, user is informing OSB what version of OpenSearch the test cluster is and based off the major version, OSB will fetch workload code associated with that major version. For example, if you are running tests against OpenSearch 2.16, OSB can automatically detect the major version or you can specify --distribution-version=2.16.0. OSB will recognize that the major version of the test cluster is 2 and will select workloads from this branch of the official workloads: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/2

By isolating the commands, I meant have you run them individually rather than how you're running them with the ampersands?

IanHoang commented 1 week ago

@AndreKurait Is this issue still occurring? Closing this issue as of now. Feel free to reopen if needed

opensearch-project / opensearch-benchmark