Closed chrisbeard closed 4 years ago
Encountered similar timeouts.
Kafka: v2.4 Cruise Control: 2.4.9 Tens of thousands of partitions
Bumping that timeout up to 60s seems to work most of the time for my cluster. It would seem reasonable that the number of actual in-flight reassignments directly affects pull time from the controller?
@chrisbeard thanks for submitting the issue and @Keydrain thanks for sharing a use-case. There is a pending todo item to make this timeout configurable.
In addition to that, its default value should be bumped up to a more reasonable value and it should involve a retry logic in case of a timeout.
Configs
Kafka:
v2.5
Cruise-control tag:2.5.0
The cluster has thousands of partitions.Issue
During a rebalance where there were ~50 inter-broker reassignments in progress, we are seeing a few things succeed but after ten seconds the executor runs into this exception. This has happened every time I've tried a rebalance.
Workaround
It looks like raising this 10s timeout to 60s helped my test rebalances complete. I'm not sure why 10s was insufficient for my rebalance though, given that the number of reassignments in-flight was fairly small.