Open theopigott opened 2 years ago
Hi @theopigott, thanks for bringing this up. It should be easy enough to simply ignore KafkaBackOffException
s - we do have a SeekUtils.isBackOffException
method.
I'm wondering though if it'd be interesting to have separate metrics of the KafkaBackOffException
- that would enable us to have metrics such as the number of back offs that happened, or maybe even have more granular information such as the partition that was backed off and the time the message should be processed, or maybe the back off time for each occurrence.
What do you think? Thanks.
Thanks @tomazfernandes - yes that makes sense for an easy fix.
It could indeed also be interesting to record metrics based on the KafkaBackOffException
too. I think that the number of backoffs and the back off time/sleep duration might be of interest for tuning the retry topic configuration - they would give some indication of how idle the listener is due to delayed topics. I wouldn't generate metrics of the absolute time that the message should be processed as that's difficult to normalize and aggregate.
Makes sense @theopigott, thanks. Is that something you'd be interested in contributing a PR to? If that's ok with @garyrussell of course.
Given that it's a behavior change; it would have to be optional (e.g. ContainerProperties
new property), the new behavior can be the default but we need to be able to switch back if users want to.
Adding highly variable tags to meters is not recommended (e.g. actual back off time); some metrics back-ends don't handle such variability well.
I think that the back off time would be the value of a metric rather than a tag. Then the 'count' would be the number of back-offs that happened, and the 'sum' (for example) would be the total time sleeping due to back-offs.
But would it make sense to first skip treating KafkaBackoffException
as failures (with an option to switch back if needed, as you rightly suggest), and then potentially add further metrics about back-offs later in a separate change/PR?
Makes sense; thanks.
Expected Behavior
As per the docs on 'Monitoring Listener Performance', there are
Micrometer
timers calledspring.kafka.listener
which are tagged with aresult
(success
orfailure
) andexception
. I would expect the metrics generated with thefailure
tag to capture true failures (e.g. anIOException
from some resource that is used to process records). Any back-off exceptions, which are expected to occur for topics with a delay configured, should be treated separately, e.g. with a different tag value forresult
orexception
.Current Behavior
A
failure
timer is recorded whenever aRuntimeException
occurs while processing a record. When dealing with retry topics, this includes aKafkaBackoffException
which may be thrown insideinvokeOnMessage
(or the batch equivalent) when the listener determines that the timestamp of the latest record is not ready to be processed yet. Theexception
is always recorded asListenerExecutionFailedException
so there is no way to differentiate back-off exceptions from other exceptions.Context
I would like to analyze the listener metrics to gain insight into failures (how often they happen, the performance impact, etc.), but I'm interested in application logic failures (e.g. database is unavailable) rather than expected framework level failures (back-off exceptions). I was surprised to see my metrics indicating many failures despite the application logs showing that all records were successfully processed until I realized that the failures must actually be due to these
KafkaBackoffException
s.I could implement my own timers/metrics inside my
KafkaListener
, but I would prefer to be able to use the existing timers that are provided by the framework.