twmb / franz-go

franz-go contains a feature complete, pure Go library for interacting with Kafka from 0.8.0 through 3.7+. Producing, consuming, transacting, administrating, etc.
BSD 3-Clause "New" or "Revised" License
1.78k stars 182 forks source link

Produce batch time metrics #704

Closed owenhaynes closed 3 months ago

owenhaynes commented 5 months ago

To be able to see how long it took to to produce a batch to pass to monitoring systems to work out if producing timeout is to low or Kafka is being overloaded.

Expired issues where batch's get stuck in a retry loop because the batch is timing out and no way to diagnose this.

twmb commented 4 months ago

What span of time are you attempting to capture, and with what information? I was thinking to add this to ProduceBatchMetrics, but the ProduceBatchWritten hook is only called on batches that are successfully produced.

Do you want to capture from the moment a record enters Produce through ... when? If you want to capture failures, do you mean for the batch duration to be called on every failure&retry, or only the final failure at which point the promise is called?

owenhaynes commented 4 months ago

Yeah I had a look at produce batch metrics and thought it was the wrong place.

More interested on the Kafka request time then when a message gets put on the produce queue as this is when the ProduceRequestTimeout value is used and what causes a retry to happen. So would be good for this time taken to be recorded for each retry.

I am not interested in capturing failures at the moment, but maybe its worth tracking these somehow generally, as RecordRetries can be left unbounded and the promise may never be called to allow for tracking of produce errors. So you could just end up with producing being stuck and no way to investigate.

I like the hook system as its useful to switch in and out different tools but maybe making harder to add error case handling.

twmb commented 4 months ago

So, do you want essentially the time that a record was in the client? If so, couldn't you set r.Context before producing with a key indicating "producing now", and then check time.Since that key once the promise is called?

owenhaynes commented 4 months ago

Not forgotten about this been looking at if the data we get from tracing is good enough for this

owenhaynes commented 3 months ago

Yeah trace span metrics for producing is giving good results atm closing.