linkedin / Burrow

Kafka Consumer Lag Checking
Apache License 2.0
3.75k stars 798 forks source link

Question: How to observe consumer lag, in a centralized way, in confluent cloud #642

Open robertpaker888 opened 4 years ago

robertpaker888 commented 4 years ago

Hi,

We want to centralize the way we handle consumer lag. Currently we have a number of services deployed in an aws ECS fargate cluster. Each service is configured to receive a statics payload which we inspect for a consumer lag and then publish a custom metric into cloudwatch, which allows us to act accordingly. The downside to this approach is that we have to update any new services with this code. Also, its reliant on the driver providing this feature and not all our services are in the written in the same language.

Burrow would appear to be the tool for the job. Although, it seems from my reading its doesn't work in confluent cloud. Is this still the case?

If Burrow cannot work in confluent cloud we see two other options described below. Your thoughts would be greatly appreciated:

Option 1.

Create a new ECS service in our cluster which ran a script which would in turn run the "kafka-consumer-group" console admin tool on an interval and then publish the cloudwatch metric. In your experience does these seem like a sensible approach that could work?

Option 2.

The other I've seen, that may work in confluent cloud is to setup an alert on the consumer lag in control-center and then have a lambda in aws issue a rest-api request to Get the alert history on a schedule and then push the metric from there

vlasov01 commented 3 years ago

Hi,

I'm in a similar situation. I think option 2 is not working as Confluent Cloud doesn't keep consumer lag alerts history as per https://github.com/vdesabou/kafka-docker-playground/tree/master/ccloud/ccloud-demo#alerts .

Have you looked at other options available https://docs.confluent.io/cloud/current/monitoring/monitor-lag.html# ? I'm wondering what was your decision?

Regards

robty123 commented 3 years ago

We went with option 1, its working well and doing what we need. We have a Fargate service running in AWS which is continuously running a bash script. When the Fargate service starts up the bash script reads in a text file containing the consumer group names we want to monitor. It then issues a kafka-consumer-groups command which returns the lag. We then parse the results of the kafka-consumer-groups command to get the lag value. Next, we publish a custom metric into AWS cloud watch. Finally, we use the custom metric to auto scale the relevant services.

Note our docker image used to run the service is packaged with the Kafka command line tools so we have access to the kafka-consumer-groups command. See below:

RUN curl https://downloads.apache.org/kafka/2.5.0/kafka_2.12-2.5.0.tgz --output KafkaCommandLineTools.tgz

The link below shows how to configure the command lines tools to work with the confluent cluster https://www.confluent.io/blog/using-apache-kafka-command-line-tools-confluent-cloud/