spring-projects / spring-kafka

Provides Familiar Spring Abstractions for Apache Kafka
https://projects.spring.io/spring-kafka
Apache License 2.0
2.17k stars 1.55k forks source link

Differentiate back-off exceptions from 'real' application errors in Listener Micrometer timer metrics for retry topics #2237

Open theopigott opened 2 years ago

theopigott commented 2 years ago

Expected Behavior

As per the docs on 'Monitoring Listener Performance', there are Micrometer timers called spring.kafka.listener which are tagged with a result (success or failure) and exception. I would expect the metrics generated with the failure tag to capture true failures (e.g. an IOException from some resource that is used to process records). Any back-off exceptions, which are expected to occur for topics with a delay configured, should be treated separately, e.g. with a different tag value for result or exception.

Current Behavior

A failure timer is recorded whenever a RuntimeException occurs while processing a record. When dealing with retry topics, this includes a KafkaBackoffException which may be thrown inside invokeOnMessage (or the batch equivalent) when the listener determines that the timestamp of the latest record is not ready to be processed yet. The exception is always recorded as ListenerExecutionFailedException so there is no way to differentiate back-off exceptions from other exceptions.

Context

I would like to analyze the listener metrics to gain insight into failures (how often they happen, the performance impact, etc.), but I'm interested in application logic failures (e.g. database is unavailable) rather than expected framework level failures (back-off exceptions). I was surprised to see my metrics indicating many failures despite the application logs showing that all records were successfully processed until I realized that the failures must actually be due to these KafkaBackoffExceptions.

I could implement my own timers/metrics inside my KafkaListener, but I would prefer to be able to use the existing timers that are provided by the framework.

tomazfernandes commented 2 years ago

Hi @theopigott, thanks for bringing this up. It should be easy enough to simply ignore KafkaBackOffExceptions - we do have a SeekUtils.isBackOffException method.

I'm wondering though if it'd be interesting to have separate metrics of the KafkaBackOffException - that would enable us to have metrics such as the number of back offs that happened, or maybe even have more granular information such as the partition that was backed off and the time the message should be processed, or maybe the back off time for each occurrence.

What do you think? Thanks.

theopigott commented 2 years ago

Thanks @tomazfernandes - yes that makes sense for an easy fix.

It could indeed also be interesting to record metrics based on the KafkaBackOffException too. I think that the number of backoffs and the back off time/sleep duration might be of interest for tuning the retry topic configuration - they would give some indication of how idle the listener is due to delayed topics. I wouldn't generate metrics of the absolute time that the message should be processed as that's difficult to normalize and aggregate.

tomazfernandes commented 2 years ago

Makes sense @theopigott, thanks. Is that something you'd be interested in contributing a PR to? If that's ok with @garyrussell of course.

garyrussell commented 2 years ago

Given that it's a behavior change; it would have to be optional (e.g. ContainerProperties new property), the new behavior can be the default but we need to be able to switch back if users want to.

Adding highly variable tags to meters is not recommended (e.g. actual back off time); some metrics back-ends don't handle such variability well.

theopigott commented 2 years ago

I think that the back off time would be the value of a metric rather than a tag. Then the 'count' would be the number of back-offs that happened, and the 'sum' (for example) would be the total time sleeping due to back-offs.

But would it make sense to first skip treating KafkaBackoffException as failures (with an option to switch back if needed, as you rightly suggest), and then potentially add further metrics about back-offs later in a separate change/PR?

garyrussell commented 2 years ago

Makes sense; thanks.