reactor / reactor-kafka

Reactive Kafka Driver with Reactor
http://projectreactor.io
615 stars 229 forks source link

Consumption stopped with slow consumer #386

Open abialas opened 8 months ago

abialas commented 8 months ago

I have a simple but slow consumer which consumes 1 record at time:

  private Flux<Void> consumeRecords(Flux<ReceiverRecord<String, Value>> records) {
    return records
        .concatMap(receiverRecord -> handleReceivedRecord(receiverRecord))
        .concatMap(this::ackOffset)
        .doOnEach(logOnError(logErrorReceiveRecord()))
        .retryWhen(RETRY_FAILED_CONSUMER_SPEC);
  }

Processing time of method handleReceivedRecord is less than 500ms. I understand this consumer is slow and needs to be fixed (because of concurrency). However, in my test I produce just about 3000 records in 1 minute to the topic the above consumer is consuming from. Initially it consumes fine but after some time I see consumer is not consuming anymore. There is no error log or similar.

In the logs I see such messages:

Rebalance during back pressure, re-pausing new assignments Rebalancing; waiting for 104 records in pipeline

and I have to restart consumer instance to fix this. It is also worth to mention that when I disable scaling up of consumers it works fine.

Expected Behavior

Consuming records from topic should not stop.

Actual Behavior

Consuming records from topic is stuck and restart is required.

Your Environment

3vl commented 3 months ago

I am seeing similar behavior in 1.3.23. It seems to happen intermittently. I think this happens after a rebalance?

It looks like the partitions have been paused and never resumed. I added an endpoint that allows me to see paused partitions. When consumption stops I can see that the partitions are paused. If I use another endpoint to force them to resume the consumption starts again.

@abialas, can you reproduce consistently or is it an intermittent problem like I am seeing.

3vl commented 3 months ago

I don't see how the partitions paused on this line are resumed https://github.com/reactor/reactor-kafka/blob/2ae3abbc7a876008585eef4972d4fd4af30e2263/src/main/java/reactor/kafka/receiver/internals/ConsumerEventLoop.java#L248

The only place I see a resume Is for the partitions in pausedByUs and these aren't added to that collection.