[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK)

rudziankou commented 2 years ago

Hi folks, I ran the benchmark against an existing OpenSearch 2.2.1 cluster and got the following error:

2022-10-25 17:24:05,778 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO Telling worker_coordinator to start benchmark. 2022-10-25 17:24:05,779 ActorAddr-(T|:53454)/PID:51383 osbenchmark.worker_coordinator.worker_coordinator INFO Benchmark is about to start. 2022-10-25 17:24:05,780 ActorAddr-(T|:53454)/PID:51383 osbenchmark.worker_coordinator.worker_coordinator INFO Attaching cluster-level telemetry devices. 2022-10-25 17:24:06,801 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO Received a benchmark failure from [ActorAddr-(T|:53454)] and will forward it now. 2022-10-25 17:24:06,699 ActorAddr-(T|:53454)/PID:51383 osbenchmark.telemetry INFO JvmStatsSummary on benchmark start 2022-10-25 17:24:06,783 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor ERROR Error in worker_coordinator Traceback (most recent call last):

78e55e46-f20f-455e-8969-2d1685e64167 File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/actor.py", line 92, in guard return f(self, msg, sender)

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 265, in receiveMsg_StartBenchmark self.coordinator.start_benchmark()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 657, in start_benchmark self.telemetry.on_benchmark_start()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 85, in on_benchmark_start device.on_benchmark_start()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 1381, in on_benchmark_start self.jvm_stats_per_node = self.jvm_stats()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 1440, in jvm_stats old_gen_collection_time = gc["old"]["collection_time_in_millis"]

KeyError: 'old'

2022-10-25 17:24:06,797 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor INFO A workload preparator has exited. 2022-10-25 17:24:06,805 -not-actor-/PID:51345 osbenchmark.test_execution_orchestrator ERROR A benchmark failure has occurred 2022-10-25 17:24:06,806 -not-actor-/PID:51345 osbenchmark.test_execution_orchestrator INFO Telling benchmark actor to exit. 2022-10-25 17:24:06,807 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring). 2022-10-25 17:24:06,808 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor INFO Main worker_coordinator received ActorExitRequest and will terminate all load generators. 2022-10-25 17:24:06,810 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:53454)] (ignoring). 2022-10-25 17:24:06,809 ActorAddr-(T|:53451)/PID:51382 osbenchmark.actor INFO BuilderActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:53432)]) 2022-10-25 17:24:06,810 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:53451)] (ignoring). 2022-10-25 17:24:09,812 -not-actor-/PID:51345 osbenchmark.Benchmark INFO Attempting to shutdown internal actor system. 2022-10-25 17:24:09,819 -not-actor-/PID:51362 root INFO ActorSystem Logging Shutdown 2022-10-25 17:24:09,843 -not-actor-/PID:51361 root INFO ---- Actor System shutdown 2022-10-25 17:24:09,846 -not-actor-/PID:51345 osbenchmark.benchmark INFO Actor system is still running. Waiting... 2022-10-25 17:24:10,853 -not-actor-/PID:51345 osbenchmark.benchmark INFO Shutdown completed. 2022-10-25 17:24:10,854 -not-actor-/PID:51345 osbenchmark.benchmark ERROR Cannot run subcommand [execute_test]. Traceback (most recent call last): File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 893, in dispatch_sub_command execute_test(cfg, args.kill_running_processes) File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 661, in execute_test with_actor_system(test_execution_orchestrator.run, cfg) File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 688, in with_actor_system runnable(cfg) File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 379, in run raise e File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 376, in run pipeline(cfg) File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 69, in call self.target(cfg) File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 314, in benchmark_only return execute_test(cfg, external=True) File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 273, in execute_test raise exceptions.BenchmarkError(result.message, result.cause) osbenchmark.exceptions.BenchmarkError: Error in worker_coordinator ('old')

Here is a command that I run: opensearch-benchmark execute_test --workload nyc_taxis --pipeline benchmark-only --target-hosts "host01:8900" --client-options "verify_certs:false,use_ssl:true,basic_auth_user:admin,basic_auth_password:admin"

It looks like the issue is here. I see some mismatch with _nodes/stats output: https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/telemetry.py#L1440-L1443

_nodes/stats output: "jvm": { "timestamp": 1666719053364, "uptime_in_millis": 1712255684, "mem": { "heap_used_in_bytes": 13321109504, "heap_used_percent": 79, "heap_committed_in_bytes": 16682844160, "heap_max_in_bytes": 16682844160, "non_heap_used_in_bytes": 3457548288, "non_heap_committed_in_bytes": 3457548288, "pools": {} }, "threads": { "count": 67, "peak_count": 70 }, "gc": { "collectors": { "GPGC New": { "collection_count": 85, "collection_time_in_millis": 6904 }, "GPGC Old": { "collection_count": 85, "collection_time_in_millis": 52220 } } }, "buffer_pools": { "mapped": { "count": 0, "used_in_bytes": 0, "total_capacity_in_bytes": 0 }, "direct": { "count": 20, "used_in_bytes": 8463983, "total_capacity_in_bytes": 8463982 } }, "classes": { "current_loaded_count": 20123, "total_loaded_count": 22803, "total_unloaded_count": 2680 } }

@IanHoang @travisbenedict guys, could you please check?

rudziankou commented 1 year ago

I deployed a single node cluster on my local and compared the /_nodes/stats/jvm query outputs: Existing cluster: "gc": { "collectors": { "GPGC New": { "collection_count": 6163, "collection_time_in_millis": 1739134 }, "GPGC Old": { "collection_count": 1095, "collection_time_in_millis": 1544272 } } } New single node local cluster "gc": { "collectors": { "young": { "collection_count": 9, "collection_time_in_millis": 203 }, "old": { "collection_count": 0, "collection_time_in_millis": 0 } } }

The existing cluster is running on Zing JDK. The local cluster is running on OpenJDK. Some metrics have different names in Zing. That's why Benchmark is failing for the existing clusters.

IanHoang commented 1 year ago

Thanks for bringing this to our attention. Just to clarify, you're running an OpenSearch 2.2.1 that is running on Zing JDK and you're receiving the following error?

old_gen_collection_time = gc["old"]["collection_time_in_millis"]

KeyError: 'old'

However, when you run with another local cluster on OpenSearch 2.2.1 with Open JDK, you do not experience any issues? We have another issue open (#242) that is experiencing the same issue because OSB currently does not support Shenandoah GC, which does not have concepts of "old, new, or permanent GC".

IanHoang commented 1 year ago

I'm not familiar with Azul Zing JDK but at quick glance, it looks like it provides an alternate GC (pauseless) compared to OpenJDK (G1GC). This confirms that the issue is similar to #242 and should be regarded as an enhancement rather than a bug because OSB currently supports GCs with concepts of old / young generations like G1GC and CMS GCs.

References:

opensearch-project / opensearch-benchmark

[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206