This bug may be a duplicate of #318. I will report in separate as I want to add all the details I've collected and it will be easier here than as a comment
Running vectorsearch workload with lucene. I did not set target_index_force_merge_timeout, btw.
So, I have 2x different deployments stuck on opensearch-benchmark as follows:
It has been in this state for several hours (likely 12+ hours now).
Now, I don't believe this is an issue with the cluster. First, the cluster indices are healthy (green state). Also, I can see that forcemerge task is gone from _tasks and _cat/segments shows that my segments shrunk from 522 count down to a handful.
Also, restarting the cluster on a rolling restart does not change seem to affect opensearch-benchmark running above.
API Results
Here is the current state of the cluster. There was a forcemerge task, originally, as well as hundreds of segments linked to target_index. It seems to me that forcemerge finished its task and the segments shrunk to some tens now.
Then, attaching GDB to any of the opensearch-benchmark processes, I can see:
...
Reading symbols from /home/ubuntu/.local/lib/python3.10/site-packages/ijson/backends/_yajl2.cpython-310-x86_64-linux-gnu.so...
Reading symbols from /home/ubuntu/.local/lib/python3.10/site-packages/ijson/backends/../../ijson.libs/libyajl-d141338e.so.2.1.0...
(No debugging symbols found in /home/ubuntu/.local/lib/python3.10/site-packages/ijson/backends/../../ijson.libs/libyajl-d141338e.so.2.1.0)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007c811231b63d in __GI___select (nfds=nfds@entry=6, readfds=readfds@entry=0x7ffe9ddebfd0, writefds=writefds@entry=0x7ffe9ddec050, exceptfds=exceptfds
@entry=0x7ffe9ddec0d0, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/select.c:69
69 ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
...
(gdb) py-bt
Traceback (most recent call first):
<built-in method select of module object at remote 0x7c8111ae2d40>
File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/transport/TCPTransport.py", line 1151, in _runWithExpiry
rrecv, rsend, rerr = select.select(wrecv, wsend,
File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
rval = self._runWithExpiry(incomingHandler)
File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run
rval = self._run_subtransport(incomingHandler, max_runtime)
File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/systemBase.py", line 139, in _run_transport
r = self.transport.run(TransmitOnly if txonly else incomingHandler,
File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/systemBase.py", line 264, in ask
response = self._run_transport(remTime.remaining())
File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/actors.py", line 738, in ask
return self._systemBase.ask(actorAddr, msg, timeout)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 265, in execute_test
result = actor_system.ask(benchmark_actor, Setup(cfg, sources, distribution, external, docker))
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 314, in benchmark_only
return execute_test(cfg, external=True)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 69, in __call__
self.target(cfg)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 378, in run
pipeline(cfg)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 711, in with_actor_system
runnable(cfg)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 684, in execute_test
with_actor_system(test_execution_orchestrator.run, cfg)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 924, in dispatch_sub_command
execute_test(cfg, args.kill_running_processes)
File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 1004, in main
success = dispatch_sub_command(arg_parser, args, cfg)
File "/home/ubuntu/.local/bin/opensearch-benchmark", line 8, in <module>
sys.exit(main())
(gdb)
All the processes involved in the benchmark are stuck in the same step.
Some Early Conclusions
It seems the main issue is the hanging on the thespian logic, as it waits for select to return that never happens.
What I'd propose is the following:
I think we should always have a timeout on any API calls, even if the user does not define it. If the user did not explicitly set a timeout, then we set a default value (e.g. 1 hour), once we hit this timeout, we rerun the same actor.ask (logging a warn that a timeout from thespian has happened). That will force the benchmark to refresh its connection with OpenSearch.
I could not get to the bottom of why an answer was not provided here, by the cluster and if it was, why did that not "bubble up" to the benchmark.
To reproduce
Deploy a 3x opensearch cluster on AWS, on 3x i4i.2xlarge with 500G gp3.
I also did two manual changes:
1) On each opensearch unit, increase memory footprint of the JVM to: -Xms32g -Xmx32g
2) On the benchmark ~/.benchmark/benchmarks/workloads/default/vectorsearch/indices/lucene-index.json, I've inserted:
Describe the bug
This bug may be a duplicate of #318. I will report in separate as I want to add all the details I've collected and it will be easier here than as a comment
Running
vectorsearch
workload with lucene. I did not settarget_index_force_merge_timeout
, btw.So, I have 2x different deployments stuck on opensearch-benchmark as follows:
It has been in this state for several hours (likely 12+ hours now).
Now, I don't believe this is an issue with the cluster. First, the cluster indices are healthy (green state). Also, I can see that
forcemerge
task is gone from_tasks
and_cat/segments
shows that my segments shrunk from 522 count down to a handful.Also, restarting the cluster on a rolling restart does not change seem to affect
opensearch-benchmark
running above.API Results
Here is the current state of the cluster. There was a
forcemerge
task, originally, as well as hundreds of segments linked totarget_index
. It seems to me thatforcemerge
finished its task and the segments shrunk to some tens now.Full task list:
Troubleshooting
First, I tried the tcpdump and strace.
Tcpdump showed some low throughput traffic, composed solely of small, 66-long ACK packets. I'd say these are keep-alive signals.
Strace shows:
And the output pretty much hangs on that line.
To troubleshoot that, I enabled ddeb packages in my benchmark server, installed: python3-dbg, python3-dev and gdb.
Then, attaching GDB to any of the opensearch-benchmark processes, I can see:
All the processes involved in the benchmark are stuck in the same step.
Some Early Conclusions
It seems the main issue is the hanging on the thespian logic, as it waits for select to return that never happens.
What I'd propose is the following: I think we should always have a timeout on any API calls, even if the user does not define it. If the user did not explicitly set a timeout, then we set a default value (e.g. 1 hour), once we hit this timeout, we rerun the same
actor.ask
(logging a warn that a timeout from thespian has happened). That will force the benchmark to refresh its connection with OpenSearch.I could not get to the bottom of why an answer was not provided here, by the cluster and if it was, why did that not "bubble up" to the benchmark.
To reproduce
Deploy a 3x opensearch cluster on AWS, on 3x
i4i.2xlarge
with 500G gp3.In my case, I am using https://juju.is/ as my automation. I will defer to this post on how to set Juju up. The deployment is achieved with:
Then, on a separate Ubuntu 22.04, i4i.2xlarge, 250G disk, I run the benchmark, installed directly with pip.
I also did two manual changes: 1) On each opensearch unit, increase memory footprint of the JVM to:
-Xms32g -Xmx32g
2) On the benchmark~/.benchmark/benchmarks/workloads/default/vectorsearch/indices/lucene-index.json
, I've inserted:Expected behavior
Not to hang: gracefully discover the cluster finished
forcemerge
and end its activities.Screenshots
--
Host / Environment
Ubuntu 22.04 for both benchmark and opensearch machines Python 3.10 OpenSearch 2.14.0 opensearch-benchmark 1.6.0
AWS + i4i.2xlarge hosts running on top of gp3 for its data
Additional context
No response
Relevant log output