strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.78k stars 1.28k forks source link

How to monitor or get notified with failed task in kctr? #3593

Closed saranyaeu2987 closed 3 years ago

saranyaeu2987 commented 4 years ago

I have multiple kafkaconnectors running

$ kubectl get kctr
NAME AGE heb-emd-ck-coll-cntl-connector 47h heb-emd-cmptr-typ-connector 47h heb-emd-company-connector 47h heb-emd-dist-connector 47h heb-emd-ecomm-mkt-area-connector 47h heb-emd-financial-div-connector 47h heb-emd-frnt-end-connector 47h heb-emd-heb-dt-connector 47h heb-emd-lin-of-bus-connector 47h heb-emd-location-connector 47h heb-emd-location-type-connector 47h heb-emd-mkt-rgn-connector 47h heb-emd-retail-loc-connector 47h heb-emd-retl-loc-frmt-connector 47h heb-emd-retl-loc-segm-connector 47h heb-emd-retl-loc-stat-typ-connector 47h heb-emd-retl-loc-typ-connector 47h heb-emd-retl-unld-connector 47h heb-emd-srs-sys-connector 47h heb-emd-str-opstmt-hrb-typ-connector 47h heb-emd-str-opstmt-srt-cd-connector 47h heb-emd-str-opstmt-typ-connector 47h heb-pharma-prepay-detail-connector 40h heb-s3-selumalai 45h

Some of tasks in kctr are failed with NULL pointer exception

image

My questions are

  1. If a task failed, will other running task in kafka connector continue to read and push data to destination? (I am seeing data loss when task failed)
  2. How to watch/monitor for such failed tasks as its causing data loss
scholzj commented 4 years ago

I think there are some Prometheus metrics with connector and task startup failures. I guess these can be used?

I wonder if a connector with failed tasks should also have some condition noting it and not be just ready. WDYT @tombentley?

tombentley commented 4 years ago

@scholzj it might make it slightly easier to detect failed tasks if there were a single condition to look out for, rather than having to iterate a list. But I guess it would also need to fit nicely with https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md, so it's clear whether a connector has failed, but is being restarted (so there's no immediate action necessary) v.s. failed and either auto restart is disabled or the max number of restarts has been exceeded (i.e. do something).

saranyaeu2987 commented 4 years ago

@tombentley @scholzj Why the failed tasks are not restarting/removed? I am seeing data loss with failed tasks.

I also see failure due to "org.apache.kafka.connect.errors.ConnectException: Task already exists? Whats

image

scholzj commented 4 years ago

Restarting failed tasks is currently something you have to do manually. There is a proposal for the future, but not implemented yet.

I do not know what the error means - I never saw it. I would not expect this to cause any message loss. The tasks are not running, but why should they have any messages lost?

saranyaeu2987 commented 4 years ago

@scholzj

  1. Why command to use to start the task manually?
  2. message count in topic and destination are not matching. I thought it was because of failed task. Can there by any other reasons? Any guidance on finding the reason for data loss?
scholzj commented 4 years ago

Why command to use to start the task manually?

You would need to use the REST API for it: http://kafka.apache.org/documentation/#connect_rest ... you can for example exec into the Connect pod and talk to localhost:8083.

message count in topic and destination are not matching. I thought it was because of failed task. Can there by any other reasons? Any guidance on finding the reason for data loss?

My expectation would be that if the task is not running, the messages might be delayed or not forwarded. But I think that if they are lost then you need to increase the retention on the topic or the connector has some bugs.

saranyaeu2987 commented 4 years ago

@scholzj

  1. I see the message in the topic, but not in destination. So would it be the issue with the connector?

Different question

  1. Can I have 2 different kafka connect cluster with different kctr running in respective kafkaconnect? (something like below)
kubectl get kc                                               
NAME                   DESIRED REPLICAS
emd-kafka-cluster      1
pharma-kafka-cluster   3
scholzj commented 4 years ago

I see the message in the topic, but not in destination. So would it be the issue with the connector?

I do not know even what connector are you talking about. But it can be that it just doesn't send the messages, but will do it once it is running again. Which is not the same as losing them. You would need to check the offsets to see whether they were already consumed or not.

Can I have 2 different kafka connect cluster with different kctr running in respective kafkaconnect? (something like below)

Not sure I follow ... you can have two connect clusters each connected to different Kafka of course. But you cannot have one Connect cluster which is in the same time connected to multiple Kafka clusters and having different connectors for each of them. (unless you have a connector which has its own client and connects to Kafka on both sides of course, which is how Mirror Maker 2 works.)

saranyaeu2987 commented 4 years ago

Not sure I follow ... you can have two connect clusters each connected to different Kafka of course. But you cannot have one Connect cluster which is in the same time connected to multiple Kafka clusters and having different connectors for each of them. (unless you have a connector which has its own client and connects to Kafka on both sides of course, which is how Mirror Maker 2 works.)

2 connect clusters connecting to same kafka cluster, but listens to different topics and push data to different destination. Does strimzi allow it?

scholzj commented 4 years ago

Oh yes, you can have as many Connects connecting to same Kafka cluster as you want. But you have to keep this in mind: https://strimzi.io/docs/operators/latest/full/using.html#con-kafka-connect-multiple-instances-deployment-configuration-kafka-connect

Each connect needs to have its own topics it will use and its own groups. So each needs to have different values for these options:

    group.id: connect-cluster
    offset.storage.topic: connect-cluster-offsets
    config.storage.topic: connect-cluster-configs
    status.storage.topic: connect-cluster-status
scholzj commented 4 years ago

@saranyaeu2987 Anything more we can help with here?