[BUG] OSB sometimes hangs on long-running requests

gkamat commented 1 year ago

Describe the bug With long-running requests like merges, OSB sometimes hangs although the operation has completed.

To Reproduce This is intermittent, but running OSB with the http_logs workload occasionally hangs on the force-merge-1-seg request. The operation eventually completes, but much after the cluster has completed the task. this can be verified by re-running the request manually, while OSB is waiting for its completion.

|                                  100th percentile service time |                          asc_sort_with_after_timestamp |      25.232 |      ms |
|                                                     error rate |                          asc_sort_with_after_timestamp |           0 |       % |
|                                       100th percentile latency |                                      force-merge-1-seg | 7.20048e+06 |      ms |
|                                  100th percentile service time |                                      force-merge-1-seg | 7.20048e+06 |      ms |
|                                                     error rate |                                      force-merge-1-seg |         100 |       % |
|                                                 Min Throughput |                         wait-until-merges-1-seg-finish |        3.11 |   ops/s |
|                                                Mean Throughput |                         wait-until-merges-1-seg-finish |        3.11 |   ops/s |
|                                              Median Throughput |                         wait-until-merges-1-seg-finish |        3.11 |   ops/s |

Expected behavior No "hanging" or "stuck" behavior, and accurate reporting on the disposition of the request.

More Context (please complete the following information): OSB 1.0 against OSB 2.3+

gkamat commented 1 year ago

Allowing some time for the merge operation to complete, a manual request via curl returns instantaneously:

{"_shards":{"total":120,"successful":120,"failed":0}}

but OSB remains stuck at:

2023-06-05 23:03:17,281 ActorAddr-(T|:46511)/PID:213463 osbenchmark.worker_coordinator.worker_coordinator INFO Task assertions enabled: False
2023-06-05 23:03:17,281 ActorAddr-(T|:46511)/PID:213463 osbenchmark.worker_coordinator.worker_coordinator INFO Choosing [unthrottled] for [force-merge-1-seg].
2023-06-05 23:03:17,282 ActorAddr-(T|:46511)/PID:213463 osbenchmark.worker_coordinator.worker_coordinator INFO Creating iteration-count based schedule with [None] distribution for [force-merge-1-seg] with [0] warmup iterations and [1] iterations.
2023-06-05 23:03:17,282 ActorAddr-(T|:46511)/PID:213463 osbenchmark.worker_coordinator.worker_coordinator INFO iteration-count-based schedule will determine when the schedule for [force-merge-1-seg] terminates.

gkamat commented 1 year ago

The issue appears to be on the OpenSearch side. Issuing a long-running force-merge via urllib3 outside of OSB sometimes never returns from the call. For now, a workaround by making a change to the http_logs workload (locally) might suffice.

bbarani commented 1 year ago

@gkamat @IanHoang Do we need to open an issue in OpenSearch core to investigate this behavior? CC: @dblock @nknize

opensearch-project / opensearch-benchmark

[BUG] OSB sometimes hangs on long-running requests #323