spring-cloud / spring-cloud-stream-binder-kafka

Spring Cloud Stream binders for Apache Kafka and Kafka Streams
Apache License 2.0
331 stars 301 forks source link

Healthcheck is down when consumers are not running #1138

Closed snussbaumer closed 2 years ago

snussbaumer commented 3 years ago

This is a recent enhancement described in https://github.com/spring-cloud/spring-cloud-stream-binder-kafka/issues/1059, more precisely this line of code

I understand the rationale behind it, but this wreaks a bit of havoc in our case : having the listeners not running is a valid case for us.

For example some of our microservices are configured with autoStartup:false and we decide when to start consuming messages based on other factors (the availability of external services needed to process the messages for example)

An another example : we may want to stop the consumption manually (via the POST /actuator/bindings endpoint) for a maintenance operation for example.

Our services are running on Kubernetes, and the liveness probe is now restarting the services in those nominal cases.

A configuration property indicating whether to check if the listener container is running may be a nice way to handle all cases.

sobychacko commented 2 years ago

@snussbaumer We can consider adding a new binder level property that allows the applications to not consider the listener container's status while doing a health check. My thinking is that, if that property is enabled, then the health check will go ahead irrespective of the call to isRunning() on the container. This way, we can satisfy both cases - i.e, by default, the status will be down if the container is not running, but if the property is set, then the status will be up unless there are some other issues with the consumer. The new property could be named something like ignoreContainerStatusForHealthCheck which is false by default. Suggestions for a better property are welcomed.

cc/ @tdanylchuk Since you introduced this change originally, do you have any feedback on this issue?

tdanylchuk commented 2 years ago

Introducing new property will indeed work out for both cases, at the same time I wonder if we can distinguish between failure of consuming messages and actual manual stop of the container, which is expected.

I would probably suggest exposing some sort of isHealthy method on the container to track actual failures. Since in the above case: what if after enabling consumption -> listener will fail? In this case, there would be no feedback to outer systems that the application is not performing well.

snussbaumer commented 2 years ago

Yes the property would work for me thanks !

But now that @tdanylchuk mentions an isHealthy method, I even wonder if isRunning() really returns false if the listener is in an unhealthy state (even after enabling consumption).

So as suggested by Taras, exposing an isHealthy() method instead of relying on isRunning() is really the better way to go

sobychacko commented 2 years ago

Thanks, @tdanylchuk and @snussbaumer for the feedback. @garyrussell What do you think about this proposal of adding a new isHealthy() method on the container?

garyrussell commented 2 years ago

I don't know what criteria the container could use - under what conditions would it return false?

The only case I can think of is if it is configured with a ContainerStoppingErrorHandler called when some specific exception is thrown (i.e. the container self-stops rather than being stopped externally (or never started).

Right now, we have no logic to know whether a stop is due to an error or external.

Perhaps you are proposing setHealthy() as well? In which case the application can be in complete control of the container state.

snussbaumer commented 2 years ago

maybe the the criteria could just be checking whether the container should be running (based on start/stop/pause/resume) and whether it is really running (AbstractMessageListenerContainer.running) ?

garyrussell commented 2 years ago

Oh; you mean simply that isHealthy() returns true if isRunning() || isStopped().

Yes, we can do that.

Please open an issue against spring-kafka.

snussbaumer commented 2 years ago

Well if it is that simple maybe doing the change in the healthcheck indicator would be sufficient ...

garyrussell commented 2 years ago

There currently is no isStopped() - that was hypothetical.

garyrussell commented 2 years ago

https://github.com/spring-projects/spring-kafka/issues/1970

garyrussell commented 2 years ago

https://github.com/spring-projects/spring-kafka/pull/1971

snussbaumer commented 2 years ago

Thank you guys for such a quick and relevant response !