Describe the bug

This bug may be a duplicate of #318. I will report in separate as I want to add all the details I've collected and it will be easier here than as a comment

Running vectorsearch workload with lucene. I did not set target_index_force_merge_timeout, btw.

So, I have 2x different deployments stuck on opensearch-benchmark as follows:

Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running custom-vector-bulk                                                     [100% done]
Running refresh-target-index                                                   [100% done]
Running refresh-target-index-before-force-merge                                [100% done]
Running force-merge-segments                                                   [  0% done]

It has been in this state for several hours (likely 12+ hours now).

Now, I don't believe this is an issue with the cluster. First, the cluster indices are healthy (green state). Also, I can see that forcemerge task is gone from _tasks and _cat/segments shows that my segments shrunk from 522 count down to a handful.

Also, restarting the cluster on a rolling restart does not change seem to affect opensearch-benchmark running above.

API Results

Here is the current state of the cluster. There was a forcemerge task, originally, as well as hundreds of segments linked to target_index. It seems to me that forcemerge finished its task and the segments shrunk to some tens now.

ubuntu@ip-192-168-235-151:~$ curl  ... '/_cat/health'
1718359051 09:57:31 opensearch-en48 green 3 3 true 35 12 0 0 0 0 - 100.0%

ubuntu@ip-192-168-235-151:~$ curl  ... '/_cat/indices'
green open target_index              KQrdrXGURYeiMJWZ3JqUKg 6 2 10000000 0 509.6gb 169.8gb
green open .opensearch-observability pSyKQxCSTe6DKa3IhxLKCA 1 2        0 0    624b    208b
green open .plugins-ml-config        poLCogGjSMer9mY1VDp-5Q 1 2        1 0  11.8kb   3.9kb
green open .opendistro_security      jroNQO51T0SLQVeyrE4vQg 1 2       10 2 163.3kb  50.8kb
green open .charm_node_lock          nLdXwf6LR9KAlAn_3DKp4A 1 2        0 0    684b    228b
green open .tasks                    hXoUZSA7STmY8Fi4CJWeoA 1 1        1 0  14.4kb   7.2kb
ubuntu@ip-192-168-235-151:~$ curl  ... '/_cat/segments'
target_index         0 r 192.168.235.57  _120 1368 1665540 0 28.2gb 0 true true 9.10.0 false
target_index         0 p 192.168.235.152 _11z 1367 1665540 0 28.2gb 0 true true 9.10.0 false
target_index         0 r 192.168.235.16  _129 1377 1665540 0 28.2gb 0 true true 9.10.0 false
target_index         1 r 192.168.235.57  _11i 1350 1667998 0 28.3gb 0 true true 9.10.0 false
target_index         1 p 192.168.235.152 _13d 1417 1667998 0 28.3gb 0 true true 9.10.0 false
target_index         1 r 192.168.235.16  _12h 1385 1667998 0 28.3gb 0 true true 9.10.0 false
target_index         2 r 192.168.235.57  _10k 1316 1666071 0 28.3gb 0 true true 9.10.0 false
target_index         2 p 192.168.235.152 _120 1368 1666071 0 28.3gb 0 true true 9.10.0 false
target_index         2 r 192.168.235.16  _130 1404 1666071 0 28.3gb 0 true true 9.10.0 false
target_index         3 r 192.168.235.57  _10j 1315 1667528 0 28.3gb 0 true true 9.10.0 false
target_index         3 p 192.168.235.152 _12j 1387 1667528 0 28.3gb 0 true true 9.10.0 false
target_index         3 r 192.168.235.16  _11r 1359 1667528 0 28.3gb 0 true true 9.10.0 false
target_index         4 r 192.168.235.57  _11g 1348 1665771 0 28.2gb 0 true true 9.10.0 false
target_index         4 p 192.168.235.152 _12o 1392 1665771 0 28.2gb 0 true true 9.10.0 false
target_index         4 r 192.168.235.16  _12d 1381 1665771 0 28.2gb 0 true true 9.10.0 false
target_index         5 r 192.168.235.57  _114 1336 1667092 0 28.3gb 0 true true 9.10.0 false
target_index         5 p 192.168.235.152 _119 1341 1667092 0 28.3gb 0 true true 9.10.0 false
target_index         5 r 192.168.235.16  _13i 1422 1667092 0 28.3gb 0 true true 9.10.0 false
.plugins-ml-config   0 r 192.168.235.57  _0      0       1 0  3.6kb 0 true true 9.10.0 true
.plugins-ml-config   0 p 192.168.235.152 _0      0       1 0  3.6kb 0 true true 9.10.0 true
.plugins-ml-config   0 r 192.168.235.16  _0      0       1 0  3.6kb 0 true true 9.10.0 true
.opendistro_security 0 r 192.168.235.57  _0      0      10 2 50.5kb 0 true true 9.10.0 true
.opendistro_security 0 p 192.168.235.152 _0      0      10 2 50.5kb 0 true true 9.10.0 true
.opendistro_security 0 r 192.168.235.16  _1      1       1 0 25.1kb 0 true true 9.10.0 true
.opendistro_security 0 r 192.168.235.16  _b     11       8 2 30.4kb 0 true true 9.10.0 false
.opendistro_security 0 r 192.168.235.16  _c     12       1 0  5.4kb 0 true true 9.10.0 true
.tasks               0 r 192.168.235.57  _0      0       1 0  6.8kb 0 true true 9.10.0 true
.tasks               0 p 192.168.235.152 _0      0       1 0  6.8kb 0 true true 9.10.0 true

Full task list:

curl ... '/_tasks?pretty'
{
  "nodes" : {
    "R40sOviiQN6UZwMjtJXuzA" : {
      "name" : "opensearch-14",
      "transport_address" : "192.168.235.57:9300",
      "host" : "192.168.235.57",
      "ip" : "192.168.235.57:9300",
      "roles" : [
        "cluster_manager",
        "coordinating_only",
        "data",
        "ingest",
        "ml"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "tasks" : {
        "R40sOviiQN6UZwMjtJXuzA:27793" : {
          "node" : "R40sOviiQN6UZwMjtJXuzA",
          "id" : 27793,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis" : 1718359386715,
          "running_time_in_nanos" : 325418,
          "cancellable" : false,
          "cancelled" : false,
          "parent_task_id" : "ndyAV8BoStSX_0XN7cLhuQ:44989",
          "headers" : { }
        }
      }
    },
    "ndyAV8BoStSX_0XN7cLhuQ" : {
      "name" : "opensearch-13",
      "transport_address" : "192.168.235.16:9300",
      "host" : "192.168.235.16",
      "ip" : "192.168.235.16:9300",
      "roles" : [
        "cluster_manager",
        "coordinating_only",
        "data",
        "ingest",
        "ml"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "tasks" : {
        "ndyAV8BoStSX_0XN7cLhuQ:44989" : {
          "node" : "ndyAV8BoStSX_0XN7cLhuQ",
          "id" : 44989,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists",
          "start_time_in_millis" : 1718359386714,
          "running_time_in_nanos" : 432250,
          "cancellable" : false,
          "cancelled" : false,
          "headers" : { }
        },
        "ndyAV8BoStSX_0XN7cLhuQ:44990" : {
          "node" : "ndyAV8BoStSX_0XN7cLhuQ",
          "id" : 44990,
          "type" : "direct",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis" : 1718359386715,
          "running_time_in_nanos" : 81986,
          "cancellable" : false,
          "cancelled" : false,
          "parent_task_id" : "ndyAV8BoStSX_0XN7cLhuQ:44989",
          "headers" : { }
        }
      }
    },
    "7GAd1nfEQ6yKgZAMwAY_Rg" : {
      "name" : "opensearch-12",
      "transport_address" : "192.168.235.152:9300",
      "host" : "192.168.235.152",
      "ip" : "192.168.235.152:9300",
      "roles" : [
        "cluster_manager",
        "coordinating_only",
        "data",
        "ingest",
        "ml"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "tasks" : {
        "7GAd1nfEQ6yKgZAMwAY_Rg:34546" : {
          "node" : "7GAd1nfEQ6yKgZAMwAY_Rg",
          "id" : 34546,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis" : 1718359386715,
          "running_time_in_nanos" : 317005,
          "cancellable" : false,
          "cancelled" : false,
          "parent_task_id" : "ndyAV8BoStSX_0XN7cLhuQ:44989",
          "headers" : { }
        }
      }
    }
  }
}

Troubleshooting

First, I tried the tcpdump and strace.

Tcpdump showed some low throughput traffic, composed solely of small, 66-long ACK packets. I'd say these are keep-alive signals.

Strace shows:

$ sudo strace -y -f -p 31846
strace: Process 31846 attached
pselect6(6, [5<socket:[102622]>], [], [5<socket:[102622]>], NULL, NULL

And the output pretty much hangs on that line.

To troubleshoot that, I enabled ddeb packages in my benchmark server, installed: python3-dbg, python3-dev and gdb.

Then, attaching GDB to any of the opensearch-benchmark processes, I can see:

...

Reading symbols from /home/ubuntu/.local/lib/python3.10/site-packages/ijson/backends/_yajl2.cpython-310-x86_64-linux-gnu.so...
Reading symbols from /home/ubuntu/.local/lib/python3.10/site-packages/ijson/backends/../../ijson.libs/libyajl-d141338e.so.2.1.0...
(No debugging symbols found in /home/ubuntu/.local/lib/python3.10/site-packages/ijson/backends/../../ijson.libs/libyajl-d141338e.so.2.1.0)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007c811231b63d in __GI___select (nfds=nfds@entry=6, readfds=readfds@entry=0x7ffe9ddebfd0, writefds=writefds@entry=0x7ffe9ddec050, exceptfds=exceptfds
@entry=0x7ffe9ddec0d0, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/select.c:69
69      ../sysdeps/unix/sysv/linux/select.c: No such file or directory.

...

(gdb) py-bt
Traceback (most recent call first):
  <built-in method select of module object at remote 0x7c8111ae2d40>
  File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/transport/TCPTransport.py", line 1151, in _runWithExpiry
    rrecv, rsend, rerr = select.select(wrecv, wsend,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
    rval = self._runWithExpiry(incomingHandler)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run
    rval = self._run_subtransport(incomingHandler, max_runtime)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/systemBase.py", line 139, in _run_transport
    r = self.transport.run(TransmitOnly if txonly else incomingHandler,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/system/systemBase.py", line 264, in ask
    response = self._run_transport(remTime.remaining())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/thespian/actors.py", line 738, in ask
    return self._systemBase.ask(actorAddr, msg, timeout)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 265, in execute_test
    result = actor_system.ask(benchmark_actor, Setup(cfg, sources, distribution, external, docker))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 314, in benchmark_only
    return execute_test(cfg, external=True)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 69, in __call__
    self.target(cfg)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/test_execution_orchestrator.py", line 378, in run
    pipeline(cfg)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 711, in with_actor_system
    runnable(cfg)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 684, in execute_test
    with_actor_system(test_execution_orchestrator.run, cfg)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 924, in dispatch_sub_command
    execute_test(cfg, args.kill_running_processes)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/osbenchmark/benchmark.py", line 1004, in main
    success = dispatch_sub_command(arg_parser, args, cfg)
  File "/home/ubuntu/.local/bin/opensearch-benchmark", line 8, in <module>
    sys.exit(main())
(gdb)

All the processes involved in the benchmark are stuck in the same step.

Some Early Conclusions

It seems the main issue is the hanging on the thespian logic, as it waits for select to return that never happens.

What I'd propose is the following: I think we should always have a timeout on any API calls, even if the user does not define it. If the user did not explicitly set a timeout, then we set a default value (e.g. 1 hour), once we hit this timeout, we rerun the same actor.ask (logging a warn that a timeout from thespian has happened). That will force the benchmark to refresh its connection with OpenSearch.

I could not get to the bottom of why an answer was not provided here, by the cluster and if it was, why did that not "bubble up" to the benchmark.

To reproduce

Deploy a 3x opensearch cluster on AWS, on 3x i4i.2xlarge with 500G gp3.

In my case, I am using https://juju.is/ as my automation. I will defer to this post on how to set Juju up. The deployment is achieved with:

juju add-model test
juju add-space internal-space <your subnets>
juju create-storage-pool opensearch ebs volume-type=gp3

juju deploy opensearch --channel=2/edge -n3     --constraints="instance-type=i4i.2xlarge root-disk=100G spaces=internal-space"     --bind internal-space --base=ubuntu@22.04     --storage opensearch-data=opensearch,500G
juju deploy self-signed-certificates -n1 --constraints="instance-type=t2.medium root-disk=100G spaces=internal-space" --bind internal-space --base=ubuntu@22.04
juju relate self-signed-certificates opensearch
juju expose opensearch  # opens the port 9200 for external traffic

Then, on a separate Ubuntu 22.04, i4i.2xlarge, 250G disk, I run the benchmark, installed directly with pip.

$ cat lucene-index.json 
{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/lucene-index.json",
    "target_index_primary_shards": 6,
    "target_index_replica_shards": 2,
    "target_index_dimension": 768,
    "target_index_space_type": "innerproduct",

    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_corpus": "cohere-10m",
    "target_index_bulk_indexing_clients": 16,
    "search_clients": 16,

    "target_index_max_num_segments": 1,
    "hnsw_ef_search": 256,
    "hnsw_ef_construction": 256,

    "query_k": 100,
    "query_body": {
         "docvalue_fields" : ["_id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_corpus": "cohere-10m",
    "query_count": 10000
  }

I also did two manual changes: 1) On each opensearch unit, increase memory footprint of the JVM to: -Xms32g -Xmx32g 2) On the benchmark ~/.benchmark/benchmarks/workloads/default/vectorsearch/indices/lucene-index.json, I've inserted:

        ,"translog.flush_threshold_size" : "8192MB"
        ,"refresh_interval": "30s"
        ,"merge.scheduler.max_merge_count": "16"

Expected behavior

Not to hang: gracefully discover the cluster finished forcemerge and end its activities.

Screenshots

Host / Environment

Ubuntu 22.04 for both benchmark and opensearch machines Python 3.10 OpenSearch 2.14.0 opensearch-benchmark 1.6.0

AWS + i4i.2xlarge hosts running on top of gp3 for its data

Additional context

No response

Relevant log output

Benchmark logs can be found on: https://pastebin.ubuntu.com/p/cWcDcSvTm7/
Server logs do not show any errors during the entire process.

opensearch-project / opensearch-benchmark

[Bug]: Vectorsearch (Cohere 10M) hangs on forcemerge #550