Closed owenhaynes closed 3 months ago
What span of time are you attempting to capture, and with what information? I was thinking to add this to ProduceBatchMetrics, but the ProduceBatchWritten hook is only called on batches that are successfully produced.
Do you want to capture from the moment a record enters Produce through ... when? If you want to capture failures, do you mean for the batch duration to be called on every failure&retry, or only the final failure at which point the promise is called?
Yeah I had a look at produce batch metrics and thought it was the wrong place.
More interested on the Kafka request time then when a message gets put on the produce queue as this is when the ProduceRequestTimeout
value is used and what causes a retry to happen. So would be good for this time taken to be recorded for each retry.
I am not interested in capturing failures at the moment, but maybe its worth tracking these somehow generally, as RecordRetries
can be left unbounded and the promise may never be called to allow for tracking of produce errors. So you could just end up with producing being stuck and no way to investigate.
I like the hook system as its useful to switch in and out different tools but maybe making harder to add error case handling.
So, do you want essentially the time that a record was in the client? If so, couldn't you set r.Context before producing with a key indicating "producing now", and then check time.Since that key once the promise is called?
Not forgotten about this been looking at if the data we get from tracing is good enough for this
Yeah trace span metrics for producing is giving good results atm closing.
To be able to see how long it took to to produce a batch to pass to monitoring systems to work out if producing timeout is to low or Kafka is being overloaded.
Expired issues where batch's get stuck in a retry loop because the batch is timing out and no way to diagnose this.