linkedin / kafka-monitor

Xinfra Monitor monitors the availability of Kafka clusters by producing synthetic workloads using end-to-end pipelines to obtain derived vital statistics - E2E latency, service produce/consume availability, offsets commit availability & latency, message loss rate and more.
https://engineering.linkedin.com/blog/2016/05/open-sourcing-kafka-monitor
Apache License 2.0
2.02k stars 445 forks source link

UpdatePartitionState to avoid restarting Producer #375

Open suyashtava opened 2 years ago

suyashtava commented 2 years ago

On Every new Partition, We are killing the whole Producer Service, and restarting it, instead we can update the state of Partition to include new partitions. This will stop unnecessary restart.

Issue: https://github.com/linkedin/kafka-monitor/issues/376

suyashtava commented 2 years ago

@linkedin, @mhratson, = @andrewchoi5 @Lincong for review

suyashtava commented 2 years ago

@efeg for review pls.

suyashtava commented 2 years ago

@CCisGG Any one whom you can add for review please.

CCisGG commented 2 years ago

I don't have much context on this repo. @mitchhh22 could you or your team help to review this?

mhratson commented 2 years ago

I can take a look next week. /Maryan

mhratson commented 2 years ago

@suyashtava thanks for the contribution! Before accepting the PR i'd like to understand more of the problem being solved by this.

Could you please describe why restarting is an issue? While I can assume that restarting may be slow i'd like to know other arguments as well if available.

Thanks

suyashtava commented 2 years ago

@suyashtava thanks for the contribution! Before accepting the PR i'd like to understand more of the problem being solved by this.

Could you please describe why restarting is an issue? While I can assume that restarting may be slow i'd like to know other arguments as well if available.

Thanks

@mhratson

Background: We were using KMF for Broker health Detection, and since all partitions use the same Producer, even if one broker is slow in a cluster all the other partitions in the queue of the same Broker will also get slow, and that was making it difficult to find an unhealthy broker.

For this, we introduced 1:1 mapping of Partition and Producer so other partitions can still be produced by a producer and not blocked by other Producer feeling slow.

Challenge: Now on every shutdown, we need to close multiple Producers, which we did. But whenever a new partition was added, it again restarted all the producers which was causing significant slowness.

Proposal: We can discuss on 1:1 Mapping on Producer: Partition but surely restarting all Threads on each new partition is an overhead even in the current KMF, when we can just simply add the new partition to the scheduler.

suyashtava commented 2 years ago

If the 1:1 mapping of Producer and Partition sounds good I can raise another PR for the same, after this one. @mitchhh22 @mhratson @andrewchoi5 @Lincong @efeg

suyashtava commented 2 years ago

@mhratson by any chance u get a moment to check this? Thanks in advance.

mhratson commented 2 years ago

For this, we introduced 1:1 mapping of Partition and Producer so other partitions can still be produced by a producer and not blocked by other Producer feeling slow.

That's not the case for this kafka-monitor, isn't it?

suyashtava commented 1 year ago

For this, we introduced 1:1 mapping of Partition and Producer so other partitions can still be produced by a producer and not blocked by other Producer feeling slow.

That's not the case for this kafka-monitor, isn't it?

@mhratson Apolgies, for some personal reason I had to drop system. Reopening this thread. You are correct this is not the case for this KMF.

This opens 2 things: IMHO We should Decouple Producer for each partition here aswell, so that it would be easy to detect which broker is slow.

Even if we decide against above, atleast we should not be restarting whole Producer in case of new Partition, and it should be attached to same Producer, that way we will increase KMF Availability.

suyashtava commented 1 year ago

Added a PR. #394 for Issue #395 (Multiple Pruder per partition)