twmb / franz-go

franz-go contains a feature complete, pure Go library for interacting with Kafka from 0.8.0 through 3.7+. Producing, consuming, transacting, administrating, etc.
BSD 3-Clause "New" or "Revised" License
1.78k stars 182 forks source link

record failed after being retried too many times #675

Closed droger88 closed 6 months ago

droger88 commented 7 months ago

This is more of question instead of bug.

we are currently implementing an API to produce the message to kafka cluster. One of the operation is to use syncProducer to produce the message. The client will get an error code if anything goes wrong, and they are responsible for retrying failed message.

With that being said, here is the configuration for the producer

opts = append(opts,
        kgo.RecordRetries(0),               //disable retry for sync producer
        kgo.UnknownTopicRetries(0),         //diable retry for unknowm topic
        kgo.MaxBufferedBytes(1000000),      
        kgo.RequiredAcks(kgo.AllISRAcks()), 
        kgo.DisableIdempotentWrite(),       
    )

however what we have noticed is if there is one request failed, all the subsequent requests are failing too. for example

1. ProduceSync(realTopic) -> receive success response
2. ProduceSync(fakeTopic) -> receive failed response (no topic or partition)
3. ProduceSync(realTopic) -> receive failed response (record failed after being retried too many times)

I have tried setting different value for recordRetries and not sure if this caused by this If a record fails due to retries, all records buffered in the same partition are failed as well. also curious how to implement the behaviour we are looking for using franz-go? should I abort all the buffered records after each produce operation?

twmb commented 7 months ago

You shouldn't see (2) causing (3) to fail, because they're producing to separate topics. Are you seeing that?

You should see, if one record fails, all other records on the same partition fail. This is because internally these records are enqueued in batches, and even across batches, records are produced with sequence numbers, and lastly when people produce messages, it's common that ordering is important. When enqueueing records internally, records have a sequence number attached. If we fail a middle batch, the later batches are out of order and we need to fail them too. It's possible to fix the sequence numbers, but I also don't want to make assumptions that ordering is not important because usually when it is, it's critically important.

Perhaps one way of doing this is to change the internal behavior to return an error for all records in a batch that actually failed, and then fail all other batches with a wrapping error, something like LaterBatchError{innerError}.

I'm not sure how you can handle the current behavior with the APIs and behavior that exists today. What error are you encountering, and why are you disabling retries entirely?

twmb commented 6 months ago

Closing as stale, but we can reopen if details are added.