[Question] How to check if consumer is healthy?

chiararelandini commented 3 years ago

I would like to check the healthiness of a kafkajs consumer running on kubernetes. I've read about several approaches but I am not sure they cover all situations for which the consumer might stop consuming messages. The approach I'm considering is to monitor the Instrumentation events STOP, DISCONNECT, CRASH to understand if the state is not healthy. Is there any case in which the consumer is not working properly which is not included in the cases generating one of the previous events? Or should I rather monitor the HEARTBEAT interval to check that the consumer is alive? Would it cover all possible cases?

Moreover, how is it possible to test unhealthy behavior locally? For example, when would a CRASH event be emitted? And how could I test this situation?

tulios commented 3 years ago

Hi @chiararelandini, as you already discovered, this is not a straightforward task 😄 you can approach this in different ways, and they are all valid and perhaps complementary

a) You can ignore the client monitoring completely and focus on the server by tracking the topic's highwater mark and the last committed offset of the consumer. The difference between the two (lag) will increase if you have issues or lack processing power. The problem with this approach is that it might take some time for you to detect issues, and in some scenarios is hard to determine the amount of lag that is healthy.

b) You can track the heartbeats of the client. This indicates that the client is alive. The time to detect any issues will be based on your session timeout, and it is in your control. This approach's problem is that a heartbeat doesn't mean that the consumer is working as expected. It only means that it is alive. Your use case might dictate your session timeout, so in some cases, the session timeout might be quite big and delay your alerts.

c) You can add a business metric to your consumer and measure that. The problem with this approach is complexity. In most cases, it is not that simple to measure in your message handler or, in one instance of your server, your system's healthiness.

So what is the recommendation? Mix and match. Use a little bit of all alternatives above adjusting to your reality.

You can measure lag from the client or server depending on what's easier and set alarms if the lag grows more than a certain threshold in a defined period. You should also track the heartbeats so you can quickly detect issues with a particular instance or with the code you just deployed. Finally, if you can, add a business metric on top of it. For example, in my previous company, we had a metric based on the total time it took a payment transaction to happen. Since our system was at the end of the flow, we tracked the portion of the time on our side and alerted if it breached our SLAs.

chiararelandini commented 3 years ago

Thank you @tulios for your reply. However I am not really interested in monitoring the service, but in knowing exactly whenever the consumer is not working properly in order to let it being restarted by Kubernetes. I have a check health route where I need to retrieve the health state of the consumer. I was planning to do this by means of an internal state updated whenever a bad event is emitted for the consumer. Do you think it would be accurate enough?

t-d-d commented 3 years ago

Presuming you are referring to the kubernetes liveness probe (readiness probe is another matter.) What we do:

just simply return a 200 on the liveness route. This in itself indicates the node.js event loop is running without too much lag
configure kafka.js consumer to NOT automatically restart, so that after retries the process will exit with an error. Thus the pod will be restarted.

chiararelandini commented 3 years ago

Thanks @t-d-d for your reply. Anyway, I would really prefer to focus on the events emitted by the consumer because I need to understand whenever the consumer, even if alive, is not working as expected. Is there a way to check if they cover all possible situations?

Nevon commented 3 years ago

The problem is that there is no way you can deduce that just from events, because "working as expected" is a business metric, not a technical one. Yes, you can probably use excessive CRASH events to infer that something is wrong, but it doesn't in any way mean that it's the consumer that's broken or that replacing the container will fix anything. It could just mean that the Kafka brokers are down, for example, in which case replacing the container won't do anything.

silversoul93 commented 3 years ago

The problem is that there is no way you can deduce that just from events, because "working as expected" is a business metric, not a technical one. Yes, you can probably use excessive CRASH events to infer that something is wrong, but it doesn't in any way mean that it's the consumer that's broken or that replacing the container will fix anything. It could just mean that the Kafka brokers are down, for example, in which case replacing the container won't do anything.

I think that she wasn't talking about the business metrics. The message management is a business metric for sure and, like you said, needs specific checks.

I use the events like CRASH, DISCONNECT and STOP to know if the consumer is healthy, or, the consumer is correctly listening to kafka and the connection between kafka and the consumer is still up and running; for the message management I use other specific health checks.

I think that @chiararelandini means "Can/should we use these events to check if the consumer (or the producer, with other events) is correctly communicating with Kafka?".

Personally, I think that these events are available for this and are so useful and all kafka libraries should provide them 😄

tulios commented 3 years ago

I think that @chiararelandini means "Can/should we use these events to check if the consumer (or the producer, with other events) is correctly communicating with Kafka?".

Then yes, you can use the events to determine if the consumer is there and is talking to Kafka.

RahulJyala7 commented 3 years ago

Can i use admin api for check it like describedGroups whether that consumer is still alive ? Or should i have to depends upon Instrumentation events STOP, DISCONNECT, CRASH ?

silversoul93 commented 3 years ago

Can i use admin api for check it like describedGroups whether that consumer is still alive ? Or should i have to depends upon Instrumentation events STOP, DISCONNECT, CRASH ?

The describedGroups does not prove that the consumer is alive and is working, but proves that kafka is reachable. We were using describedGroups and, fortunately in a non production environment, we discovered that the consumer was broken but the health check was still positive.

I suggest you to use the events

simoncpu commented 2 years ago

Hi @chiararelandini, this question is a year old, but how did you end up implementing your liveness probe? We're currently using httpGet, but I want to remove the HTTP server since it's unnecessary. I'm thinking of just using a ps command to probe for liveness and using cat for readiness.

jthiesse commented 1 year ago

Presuming you are referring to the kubernetes liveness probe (readiness probe is another matter.) What we do: - just simply return a 200 on the liveness route. This in itself indicates the node.js event loop is running without too much lag - configure kafka.js consumer to NOT automatically restart, so that after retries the process will exit with an error. Thus the pod will be restarted.

@t-d-d The "readiness probe is another matter" part; what has been your approach to check for "readiness" without starting to process messages (other dependencies may not be ready yet)?

All ideas I come up with thus far feel too clever and over engineered for something I feel should be straight forward.

t-d-d commented 1 year ago

@jthiesse I don't quite get what you are asking. Are you really interested in k8s readiness? Or service start-up?

k8s readiness probe is to indicate that your process is able to handle client requests - requests will only be directed to the service instance if it is 'ready'. For some of our backend stream processors (that only read and write to kafka) we don't define one (although we have started to, based on if the consumer is rebalancing or not. But this is purely a convenient way to observe if its processing data or not.)

For the issue of service start-up and dealing with dependant service not being there, we crash out and let k8s restart the pod with back-off - just throw an error.

jthiesse commented 12 months ago

@t-d-d Thanks for the reply.

I am looking for an approach of how to validate connection to the topic before letting the service start up (readiness check).

I am resolving a promise when listening to the "connect" event from KafkaJs. Though getting more than one "connect" event makes me feel strange about this approach. (strange in the sense there I resolve an already resolved promise)

silversoul93 commented 12 months ago

@jthiesse We developed this open source library to check the kafka status: https://github.com/mia-platform/kafka-healthchecker

It's based on kafkajs, because we love and use a lot it. We use it on services on k8s for readiness and liveness probes.

fullammo commented 1 month ago

In our startup situation, we are connecting to kafka with the consumer, before we let our application start an http server, so /liveness healthcheck call doesnt check anything related to kafka.

In readiness, we didnt find the "perfect" approach yet.

To check full business logic that my microservice is able to do with kafka, it would be really expensive and cumbersome to check in a healhcheck every 5/10 seconds, and a readiness restart wouldnt help much in these situations.

I also would like to check in the readiness, that the consumer that i'm utilizing is still able to use its credentials that it is connected with to Kafka. An endpoint that gives back some metadata about the consumer or any of its resources that uses the credentials would be sufficient from Kafka side. Sadly consumer.describeGroup giving a bit broader information about the whole consumer group, that a single consumer instance shouldn't be able to query. This way a "restart" could help, if connectivity issues arise, and if my credentials are not able to query simple metadata stuff from Kafka. If this doesnt work, this type of error should be that severe, that the microservice health should be "red".

Not sure about the kafkajs events though, because CRASH could contain recoverable errors as well. You have to be careful using it. We are exploring the possibilities of using these as well in healthcheck.

We are actively working on this, if we research something out, i will let you know :P

nivr4 commented 4 weeks ago

hey @fullammo , did you succeeded to find good solution

fullammo commented 3 weeks ago

Hey @nivr4 !

consumer.describeGroup function seems to be a good solution, because this function call's permission set has to match the least amount of permission needed for a specific consumer, because consumer have to be able to coordinate between their peers, if they are elected as group coordinators. With this "on demand" function call, you can make sure that your actually authenticated login session with your consumer is still operable.

We chose this solution right now. We didn't mix in the healthcheck endpoint the business use case checks, which is listening and reacting to CRASH and other events. For now. If we do, i will come with an update here as well :D

r3code commented 1 week ago

Here you can find some input from PagerDuty and Cloudflare

tulios / kafkajs

[Question] How to check if consumer is healthy? #1010