Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down

aratz-lasa commented 4 months ago

When a Kafka Broker leaves the cluster, the franz-go client refreshes its metadata and detects the broker’s absence. This results in a closure of broker connections and errors for inflight produce requests. Sometimes, the error received in the produce request is an EOF or TCP connection closed error, which is considered retryable, and the request retries. However, there is a race condition where the error is errUnknownBroker (non-retryable), which requires metadata to be updated. When this happens, the requests do not retry to drain the sink until another metadata request is issued (when MetadataMinAge elapses). Consequently, the produce latency spikes to either MetadataMinAge or the produce request timeout, whichever occurs first.

If the request detected that the broker is unknown but we do know a broker we can send the request to instead, we could retry immediately 🤔 I wonder what you think is the best solution for this. I have a few ideas and I'm happy to send a PR. But wanted to know your opinion first.

twmb commented 4 months ago

errUnknownBroker is returned when a request is issued to a broker number, but that number is not known because it was not returned from metadata. If you are seeing it, it means the client literally did just receive a metadata request that did not include the broker in the response.

In the linked line, errUnknownBroker triggers the client to update metadata again at the soonest opportunity -- but don't spin loop. There's no guarantee (in fact the odds are low) that broker leadership has actually transferred from the broker that has left the cluster. From the metadata response, leadership actually has not transferred yet -- that's why the metadata response said "the leader is broker 5" even though 5 was not included in the response. Issuing to a random broker likely wont help the situation. What I've seen more is that NOT_LEADER_FOR_PARTITION is returned for a longer period of time than a broker completely missing from the metadata response.

So, overall, I see the pain but I don't think that (effectively) "retry spin looping randomly across the cluster until an official leader comes back" will solve the issue, and it may create other issues.

Closing for now for issue tidiness but lmk what you think / if we should reopen.

aratz-lasa commented 4 months ago

@twmb I think I did not explain myself correctly. The metadata response doesn't include the old leader, but it does include the new one. However, the produce is not re-sent to the new leader until an eventual metadata refresh. I think it's a race condition where inflight requests sometimes get into this state.

twmb commented 4 months ago

I understand now, I have a suspicion as to what this could be.

twmb commented 3 months ago

I believe #761 should improve this behavior.

aratz-lasa commented 3 months ago

Thank you! In the following 24h I will test if this PR resolves the issue 👍

aratz-lasa commented 3 months ago

Sorry for the delay in getting back to you. I encountered some blockers while testing, but I’ve since completed the tests and haven’t experienced the issue anymore. It looks like the solution works!

I have another question: Some customers have reported similar behavior with consumers, where sudden leadership changes cause spikes of 10s (the old value for MetadataMinAge). Is there an equivalent behavior for consumers, by any chance?

twmb commented 2 months ago

Sorry for the delay here -- let me know if there's some importance to merging and cutting a release. None of the current open issues are major so I've been dragging my feet.

There is no equivalent code on the consumers. The path for backing off is:

Any partition error causes a reason to update the metadata (updateWhy): https://github.com/twmb/franz-go/blob/a5f2b710830e32fc3f90374be47ec59849807342/pkg/kgo/source.go#L1012
IF there are no reloadOffsets (from offset_out_of_range or whatever), then we directly manually trigger an immediate metadata update: https://github.com/twmb/franz-go/blob/a5f2b710830e32fc3f90374be47ec59849807342/pkg/kgo/source.go#L897-L903 (this and the next bullet really need a comment internally, I'm going to add one in some random PR; I do realize this is a bit convoluted, IIRC, I thought this years ago when I implemented it but couldn't think of something less convoluted at the time)
IF there ARE reloadOffsets, the offset reloading function triggers a metadata update itself: https://github.com/twmb/franz-go/blob/master/pkg/kgo/consumer.go#L1705-L1709
triggerUpdateMetadataNow forces an immediate metadata update, here: https://github.com/twmb/franz-go/blob/master/pkg/kgo/metadata.go#L140-L145 (you can follow that bit of code pretty easily)

Slow metadata updates that wait for the min refresh can be seen with .triggerUpdateMetadata(, vs immediate forced updates bypassing min wait times are .triggerUpdateMetadataNow(.

the slow waits are

producer.go
871:    cl.triggerUpdateMetadata(false, "reload trigger due to produce topic still not known")

metadata.go
262:                cl.triggerUpdateMetadata(true, fmt.Sprintf("re-updating metadata due to err: %s", err))
264:                cl.triggerUpdateMetadata(true, retryWhy.reason("re-updating due to inner errors"))

consumer.go
1708:       wait = s.c.cl.triggerUpdateMetadata(false, why) // avoid trigger if within refresh interval

source.go
763:        s.cl.triggerUpdateMetadata(false, fmt.Sprintf("opportunistic load during source backoff: %v", why)) // as good a time as any
902:                s.cl.triggerUpdateMetadata(false, why)

sink.go
202:    s.cl.triggerUpdateMetadata(false, "opportunistic load during sink backoff") // as good a time as any
332:                s.cl.triggerUpdateMetadata(false, "attempting to refresh broker list due to failed AddPartitionsToTxn requests")
1007:       s.cl.triggerUpdateMetadata(true, why)

Producer and sink are not the consuming code paths, Consumer and source happen in a deliberately chosen slow path as well (errors are entirely unknown topic or partition).

The metadata code path could be a culprit, but only if forcefully issued metadata requests fail 3x, at which point the code internally moves to slow retries.

If you're able to capture debug logs around unexpected 10s delays while consuming, I could look into that -- but I think neither of us know how to trigger it at the moment.

twmb commented 2 months ago

FWIW implementing KIP-951 may help the consumer side of the equation.

twmb / franz-go

Produce Latency Spikes Due To Race Condition When Brokers Are Scaled Down #746