Currently, the felice library retries forever whenever an error is returned, on the assumption that errors are temporary.
However, many errors are not temporary. For example, if a message cannot be decoded, that error is permanent.
Also, for some messages, even when the error is "temporary" (for example because a database is down), it doesn't make sense to retry forever, because the message is inherently time-bounded in some sense.
Currently code that wants to avoid retries tends to log an error and return nil. This is awkward and repetitive.
One possible solution would be to add logic to the felice consumer package that would recognize specific kinds of error and avoid retrying in those cases, but log instead. This would add more complexity to the felice package, and it also adds a tighter binding between felice and the system logger. Currently, felice can produce log messages, but it uses a different kind of logger to that used by most Galaxy-based services (the standard log package rather than logrus).
Instead of adding complexity to the felice package, I propose that a better solution is to be to remove complexity from it by removing the retry logic entirely, leaving it up to the caller (kafka-go) to implement its own retry mechanisms. If a handler returns an error, the message is simply discarded. This enables us to remove logging entirely from felice (we can define a callback function to be invoked if an error is returned from a handler), and it also enables us to remove the metrics dependency too, which is another thing that's really a concern of the galaxy-level code rather than felice.
Then we can implement specialized handler "middleware" wrappers inside kafka-go to implement the desired retry logic, including support for time-limited retries and logging using the standard galaxy logging package.
Doing this will require a backwardly incompatible change, as we wish to be able to tear down Kafka consumers gracefully and if a handler is responsible for retrying, it could block indefinitely with no way to tell it to stop. So the proposed solution is to change the felice handler to accept a context.Context value.
Although this is an API-breaking change, we can avoid breaking code in universe, because that code exclusively uses kafka-go as an API, so we can maintain API compatibility of kafka-go instead. In kafka-go, we can define a handler type, LegacyHandler, being the same as the current felice.Handler type, and add a new method, HandleContext, to use the new felice handler type.
Currently, the felice library retries forever whenever an error is returned, on the assumption that errors are temporary.
However, many errors are not temporary. For example, if a message cannot be decoded, that error is permanent.
Also, for some messages, even when the error is "temporary" (for example because a database is down), it doesn't make sense to retry forever, because the message is inherently time-bounded in some sense.
Currently code that wants to avoid retries tends to log an error and return nil. This is awkward and repetitive.
One possible solution would be to add logic to the felice consumer package that would recognize specific kinds of error and avoid retrying in those cases, but log instead. This would add more complexity to the felice package, and it also adds a tighter binding between felice and the system logger. Currently, felice can produce log messages, but it uses a different kind of logger to that used by most Galaxy-based services (the standard log package rather than logrus).
Instead of adding complexity to the felice package, I propose that a better solution is to be to remove complexity from it by removing the retry logic entirely, leaving it up to the caller (kafka-go) to implement its own retry mechanisms. If a handler returns an error, the message is simply discarded. This enables us to remove logging entirely from felice (we can define a callback function to be invoked if an error is returned from a handler), and it also enables us to remove the metrics dependency too, which is another thing that's really a concern of the galaxy-level code rather than felice.
Then we can implement specialized handler "middleware" wrappers inside kafka-go to implement the desired retry logic, including support for time-limited retries and logging using the standard galaxy logging package.
Doing this will require a backwardly incompatible change, as we wish to be able to tear down Kafka consumers gracefully and if a handler is responsible for retrying, it could block indefinitely with no way to tell it to stop. So the proposed solution is to change the felice handler to accept a context.Context value.
Although this is an API-breaking change, we can avoid breaking code in universe, because that code exclusively uses kafka-go as an API, so we can maintain API compatibility of kafka-go instead. In kafka-go, we can define a handler type, LegacyHandler, being the same as the current felice.Handler type, and add a new method, HandleContext, to use the new felice handler type.
The whole proposed API is here.