kafkaex / kafka_ex

Kafka client library for Elixir
MIT License
596 stars 162 forks source link

Don't always send update metadata requests to the same broker #395

Closed signine closed 4 years ago

signine commented 4 years ago

We have been trying to use KafkEx with an app that generates high produce rates. At first we tried to use one KafkaEx worker (the default one) but found that a few moments after starting the app, produce requests start to timeout. One worker wasn't able to keep up with the rate of produce requests coming in, so its mailbox started to fill up.

So next we tried to use a pool of workers - one per topic and partition like Brod. The app was stable now but we noticed something odd with the brokers. One of the brokers always had significantly higher system load and network traffic (bytes out) than the other brokers. After investigating it was found that the extra load was coming from the periodic metadata update requests made by all the workers.

For requests like fetching metadata and api_versions, KafkaEx will iterate through every broker that it knows about until it gets a successful response. It will normally try the brokers in the same sequence every time but the first one usually succeeds, so this first broker in the list gets an uneven amount of load.

In this PR we randomize the broker list before sending any requests in order to spread the load of update metadata requests evenly across all brokers.

Testing: All tests passed locally. I manually tested the behaviour by logging the broker list in first_broker_response()

sourcelevel-bot[bot] commented 4 years ago

Hello, @shamilish! This is your first Pull Request that will be reviewed by SourceLevel, an automatic Code Review service. It will leave comments on this diff with potential issues and style violations found in the code as you push new commits. You can also see all the issues found on this Pull Request on its review page. Please check our documentation for more information.

jbruggem commented 4 years ago

Re-started the failing test and everything passes. The change looks sound, but I don't know enough about this part to make sure that it's the logical thing to do :). It's a tiny change, so I have no doubt another maintainer will pick it up very quickly !