mailgun / kafka-pixy

gRPC/REST proxy for Kafka
Apache License 2.0
768 stars 119 forks source link

strange consume behavior with freshly created consumer group #190

Closed jhi closed 3 years ago

jhi commented 3 years ago

The attached file is a result of git archive from my demo application demonstrating the issue.

It's a little bigger than I anticipated since it contains all the up-to-date vendor dependencies, sorry about that.

The demo code itself is short, in cmd/pixy-issue/main.go. It could be even shorter but I tried to be as tidy and explicit as possible.

The bug in short: if a consumer group has been just created, it seems that one "warm-up" ConsumeNAck() call is necessary before the first Produce(), or otherwise any ConsumeNAck() calls following will fail due to the polling timeout. Yeah, sounds crazy, I know.

But it really seems that the dummy ConsumeNAck (which naturally itself fails due to timeout, since there is nothing yet produced) is needed to do ... something ... maybe it is needed for registering the consumer group properly (I am just waving my arms here).

make test to build and run the tests. The kafka is assumed to be running in localhost:9092 and the proxy in localhost:19091, command line flags are available.

The contained README.md gives more details.

pixy-issue.tar.gz

jhi commented 3 years ago

Note that while the demo program tries the Consume() after the Produce() only once, and fails due to the polling timeout, the original code I extracted this from did try consuming in a loop, so there were plenty of retries, up until the 10 minutes timeout of go test (originally this was an attempt at a unit test). So retrying is not the answer.

jhi commented 3 years ago

I also just now tested having a single six minute sleep after the produce. Nope, doesn't help, the consume after still times out.

jhi commented 3 years ago

More experimentation: I decided to check that the Produce() side is not broken and modfied the code to test with direct Kafka (sarama) producer code instead of pixy Produce(). Found no problem in that, the producing via the pixy was working fine (also verified the result with kafkacat).

I ended up doing the full matrix:

    producer=P topic=C group=C    OK
    producer=P topic=G group=C    FAIL
    producer=P topic=C group=G    FAIL
    producer=P topic=G group=G    FAIL
    producer=K topic=C group=C    OK
    producer=K topic=G group=C    FAIL
    producer=K topic=C group=G    FAIL
    producer=K topic=G group=G    FAIL

Legend: producer P(ixy) or K(afka), topic C(onstant) or G(enerated), group C(onstant) or G(enerated).

Conclusion: if either the topic name or the consumer group name are freshly generated, the code fails. If both the topic and the group already exist, the code succeeds.

Success meaning that the testing sequence:

successfully first fetches the initial offset, and then consumes+acks the messages.

Failure meaning that the (initial) ConsNAck() never seems to work, always returning the long poll timeout, even though there definitely is content available in the topic. So it's more complex than my original conclusion that's it's only about the group being freshly generated.

(And producing found to be faultless.)

horkhe commented 3 years ago

It is not crazy and not a bug it is an expected behaviour. it is describe here: https://github.com/mailgun/kafka-pixy/blob/master/quick-start-curl.md.