mailgun / kafka-pixy

gRPC/REST proxy for Kafka
Apache License 2.0
768 stars 119 forks source link

Clean up ZooKeeper group data on last member leave #155

Closed horkhe closed 5 years ago

horkhe commented 5 years ago

A consumer group data has never been removed from ZooKeeper before, that could cause issues with disposable groups that we use to implement broadcast events. With this PR when the last group member is leaving a group, its data structure in ZooKeeper is deleted. The logic is careful to survive a data race that can occur when the last member leaves but a new one joins the group at the same time.

Besides:

thrawn01 commented 5 years ago

I was unable to get kafka-pixy to clean up the /consumers in any situation.

$ zkCli --servers localhost -c lsr /consumers                                                                                                                             │  -g, --group      consumer group we are in (Default=kafka-pixy-cli, Env=KAFKA_PIXY_GROUP)
scout                                                                                                                                                                     │  -b, --buffer     how many events to buffer before consumed (Default=0, Env=KAFKA_PIXY_BUFFER)
scout/ids                                                                                                                                                                 │
scout/owners                                                                                                                                                              │
scout/owners/catchall_legacy                                                                                                                                              │thrawn at Derricks-MacBook-Pro in ~
scout/owners/catchall_unicast                                                                                                                                             │$ kafka-pixy-cli consume catchall_legacy -g scout
scout/owners/scout_recount                                                                                                                                                │^[[A^C
scout_5938c8f68bb347db88e531db25b8b678                                                                                                                                    │
scout_5938c8f68bb347db88e531db25b8b678/ids                                                                                                                                │thrawn at Derricks-MacBook-Pro in ~
scout_5938c8f68bb347db88e531db25b8b678/owners                                                                                                                             │$ kafka-pixy-cli consume catchall_legacy -g scout -e localhost:19081
scout_5938c8f68bb347db88e531db25b8b678/owners/catchall_broadcast

Here is the log from when I started a single consumer and then stopped the consumer. https://gist.github.com/thrawn01/be1388b742efb47c33fe5838682dae74

I didn't have time to dig into why it wasn't cleaning up. If you want I should have more time tomorrow.

horkhe commented 5 years ago

@thrawn01 Looks like in your case subscriber stopped before partitioncsm released the partition. I wonder how respective test managed to pass. I will look into it tomorrow.

2018-10-16 14:07:19.705718 -05 error </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/member.0> "Failed to delete empty group" error="while deleting /consumers/scout_5938c8f68bb347db88e531db25b8b678/owners/catchall_broadcast: zk: node has children" kafka.group="scout_5938c8f68bb347db88e531db25b8b678"
2018-10-16 14:07:19.705753 -05 info </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/member.0> Stopped kafka.group="scout_5938c8f68bb347db88e531db25b8b678"
...
2018-10-16 14:07:19.708201 -05 info </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/catchall_broadcast.p0.0> "Partition released: via=/_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/member.0, retries=0, took=2.02916ms" kafka.group="scout_5938c8f68bb347db88e531db25b8b678" kafka.partition=0 kafka.topic="catchall_broadcast"
2018-10-16 14:07:19.708226 -05 info </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/catchall_broadcast.p0.0> Stopped kafka.group="scout_5938c8f68bb347db88e531db25b8b678" kafka.partition=0 kafka.topic="catchall_broadcast"
horkhe commented 5 years ago

@thrawn01 fixed.