Open ColinSullivan1 opened 2 weeks ago
FYI, I tried setting the internal property, internal.leave.group.on.close = false
which didn't seem to make a difference on the system I was benchmarking.
Adding some instrumentation showed that the Kafka driver's close API was taking up to 800 milliseconds to complete. Waiting for all to consumers to close would have taken quite a long time.
One more data point - as an experiment, I commented out consumer.close()
in the Kafka driver, and this test made it completion.
I'm wondering if gathering stats and writing the output file before closing the consumers would work, followed by invoking the consumer.close() APIs in an executor to parallelize work that occurs during the close.
Another option might be to use the close API that accepts a duration, but with 20k consumers I'm not sure if that'd help enough on its own.
wdyt?
I notice theworker.stopAll()
API is called in the WorkloadGenerator.run()
before the method exits (and before the results file is generated). This prevents the results from from being generated as worker.stopAll()
takes a very long time to complete and the benchmark times out. Note that worker.stopAll()
is also called later on during the workload shutdown.
Removing this line allows me generate the results file. The benchmark still times out from the subsequent worker.stopAll()
call is made though, but at least I can get results.
Is worker.stopAll()
necessary in WorkloadGenerator.run()
?
Hello OpenMessaging maintainers - thank you for your work on this project!
I'm running some very large benchmarks (20k consumers) spread out over 8 very large machines to simulate a large scale test. As the test nears completion (I suspect when results are aggregated) there are numerous consumer errors. These result in the WorkloadGenerator timing out.
Even with smaller tests, I see the timeouts due consumer errors at the end of the tests.
For example, I can see the aggregate high level stats are OK:
e.g.
There are no errors, no backlog, and steady throughput at the rate I've specified. However, the test times out and the consumer logs show errors. Note that with high throughput tests and fewer numbers of consumers I do not see the issue.
Example Test Setup
Driver:
Workload:
Consumer Errors
Some of the consumer errors include:
I've attached output and logs of a test that exhibits these symptoms.
benchmark-output.txt sanitized-benchmark-worker.log
Would you have any suggestions for running the benchmarks with extremely large numbers of consumers? Happy to provide more information if you need it.
Edit: Perhaps all of the rebalancing is blocking some requests around shutdown/closing consumers?
Thanks!