strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.88k stars 1.31k forks source link

[Blocked while deploying KafkaMirrorMaker2 - Issue encountered on the Kafka Connect side] ... #3207

Closed vperi1730 closed 4 years ago

vperi1730 commented 4 years ago

Hi Team,

We have just deployed the KafkaMirrorMaker2 based on the documentation from Strimzi, however, there is some error that is blocking us. Hence looking for help exactly on the issue of how to resolve this.

apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaMirrorMaker2
metadata:
  name: my-mirror-maker2
spec:
  version: 2.4.0 (1)
  replicas: 3 (2)
  connectCluster: "my-cluster-target" (3)
  clusters: (4)
  - alias: "my-cluster-source" (5)
    authentication: (6)
      certificateAndKey:
        certificate: source.crt
        key: source.key
        secretName: my-user-source
      type: tls
    bootstrapServers: my-cluster-source-kafka-bootstrap:9092 (7)
    tls: (8)
      trustedCertificates:
      - certificate: ca.crt
        secretName: my-cluster-source-cluster-ca-cert
  - alias: "my-cluster-target" (9)
    authentication: (10)
      certificateAndKey:
        certificate: target.crt
        key: target.key
        secretName: my-user-target
      type: tls
    bootstrapServers: my-cluster-target-kafka-bootstrap:9092 (11)
    config: (12)
      config.storage.replication.factor: 1
      offset.storage.replication.factor: 1
      status.storage.replication.factor: 1
    tls: (13)
      trustedCertificates:
      - certificate: ca.crt
        secretName: my-cluster-target-cluster-ca-cert
  mirrors: (14)
  - sourceCluster: "my-cluster-source" (15)
    targetCluster: "my-cluster-target" (16)
    sourceConnector: (17)
      config:
        replication.factor: 1 (18)
        offset-syncs.topic.replication.factor: 1 (19)
        sync.topic.acls.enabled: "false" (20)
    heartbeatConnector: (21)
      config:
        heartbeats.topic.replication.factor: 1 (22)
    checkpointConnector: (23)
      config:
        checkpoints.topic.replication.factor: 1 (24)
    topicsPattern: ".*" (25)
    groupsPattern: "group1|group2|group3" (26)

Couple of things to understand from the above CR.

1) What is the difference between (3) and (9), are the names for connectCluster and alias are always identical, Please elaborate. 2) Are (5) and (9) the real cluster names or they can be any alias which represents the bootstrap servers respectively

@@@@@@@@@@@@@@@@@@@Error we have noticed@@@@@@@@@@@@@@@@@

2020-06-17 16:25:29,187 INFO Kafka version: 2.4.0 (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder-connect-1] 2020-06-17 16:25:29,187 INFO Kafka commitId: 77a89fcf8d7fa018 (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder-connect-1] 2020-06-17 16:25:29,187 INFO Kafka startTimeMs: 1592411129187 (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder-connect-1] 2020-06-17 16:25:29,233 INFO [Consumer clientId=consumer-mirrormaker2-cluster-1, groupId=mirrormaker2-cluster] Cluster ID: UYTj0E0PQOmG9PAb6axowg (org.apache.kafka.clients.Metadata) [DistributedHerder-connect-1] 2020-06-17 16:25:29,237 ERROR [Worker clientId=connect-1, groupId=mirrormaker2-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [DistributedHerder-connect-1] org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [mirrormaker2-cluster-offsets] 2020-06-17 16:25:29,257 INFO Stopped http_8083@2f3c6ac4{HTTP/1.1,[http/1.1]}{0.0.0.0:8083} (org.eclipse.jetty.server.AbstractConnector) [Thread-1] 2020-06-17 16:25:29,260 INFO node0 Stopped scavenging (org.eclipse.jetty.server.session) [Thread-1] 2020-06-17 16:25:29,709 INFO Started o.e.j.s.ServletContextHandler@1280851e{/,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler) [main] 2020-06-17 16:25:29,710 INFO REST resources initialized; server is started and ready to handle requests (org.apache.kafka.connect.runtime.rest.RestServer) [main] 2020-06-17 16:25:29,710 INFO Kafka Connect started (org.apache.kafka.connect.runtime.Connect) [main] 2020-06-17 16:25:29,710 INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect) [Thread-10] 2020-06-17 16:25:29,710 INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer) [Thread-10] 2020-06-17 16:25:29,710 INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer) [Thread-10] 2020-06-17 16:25:29,710 INFO [Worker clientId=connect-1, groupId=mirrormaker2-cluster] Herder stopping (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [Thread-10] 2020-06-17 16:25:34,712 INFO [Worker clientId=connect-1, groupId=mirrormaker2-cluster] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [Thread-10] 2020-06-17 16:25:34,712 INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect) [Thread-10]

Need help to move on.

scholzj commented 4 years ago

Well, this error suggests that the users you are using are missing some ACLs:

2020-06-17 16:25:29,237 ERROR [Worker clientId=connect-1, groupId=mirrormaker2-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [DistributedHerder-connect-1]
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [mirrormaker2-cluster-offsets]

The broker logs should tell you exactly which ACL is causing this. But you need to give the users the rights needed for mirroring + the connect rights on the target cluster.

vperi1730 commented 4 years ago

OK, after looking into the logs, I have realized that this topic is not something which was manually created, it is internally referred and to which the authorization is failing.

Does it mean we have to give access to this topic manually using ACL script or how is it?

scholzj commented 4 years ago

Dependes on how you manage the users and the ACLs. Basically you have two options ... give it the rights it needs for all the topics etc. either using the User Operator or manually if you do not use the user operator or just set these users as super users.

vperi1730 commented 4 years ago

OK, I have enabled all topic access using the wildcard '*' which eliminated the error for the topic authorization. However, now I see the following on the group which MM2 is using internally.

2020-06-18 10:08:52,075 INFO [Worker clientId=connect-1, groupId=mirrormaker2-cluster] Cluster ID: UYTj0E0PQOmG9PAb6axowg (org.apache.kafka.clients.Metadata) [DistributedHerder-connect-1] 2020-06-18 10:08:52,076 ERROR [Worker clientId=connect-1, groupId=mirrormaker2-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [DistributedHerder-connect-1] org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: mirrormaker2-cluster

Any idea how can we rectify this?

scholzj commented 4 years ago

This looks again like ACL - just for a consumer group and not a topic. So you will need to add it as well. TBH, I'm not sure if it uses this group for everything or not.

vperi1730 commented 4 years ago

I agree on this, based on it I have updated my KafkaUser CR and added the following parameters which I believe says this user can access to any groups. Please comment on this and confirm if this should be the approach.

operation: Read
      resource:
        name: '*'(This should access any group without exception, isn't??????)
        patternType: literal
        type: group
scholzj commented 4 years ago

TBH, we just pass the * to Kafka and out of my head I'm not sure if that works or not - I don't think I ever tried it. So I'm afraid you would need to try it your self.

vperi1730 commented 4 years ago

OK, Could you please clarify on the 2 questions.

What is the difference between (3) and (9) in the above CR, I have noticed connectCluster and alias name for the target cluster are same, 1) do they really have to be the same ? 2) for connectCluster do we need to have KafkaConnect CR created and then the name of it is referred herein the Kafka CR?

Need clarification.

vperi1730 commented 4 years ago

Along with the above ask, Do we have anyone who can support me on Kafka MirrorMaker2 issue if you can add someone from strimzi, Is that possible??

I am somehow stuck at one place where I am not able to find ways to come out.

scholzj commented 4 years ago

The clusters array lists set of cluster the Mirror Maker will be connecting to. One of that has to be configured as the connect cluster - that is where the underlying Connect cluster connects. Since the MM2 is using source connectors, this is only the only cluster which can be used as target. So the connectCluster field contains the alias of one of the clusters from the aclusters array.

The only error you had so far are authroization errors. So you just need to track them down in the broker logs and make sure the users have the sufficient rights. Or as mentioned before, you can set the users as super users or disable authorization completely. But nobody can do that for you because we do not know your clusters and their configuration.

vperi1730 commented 4 years ago

I agree on this, Sure, I am trying it, hope will be able to resolve it. Thanks for the guidance, it helps :)

vperi1730 commented 4 years ago

Can you please confirm if the below CR looks good. A couple of points to catch is

apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaMirrorMaker2
metadata:
  name: brcm-mm2
spec:
  version: 2.4.0 
  replicas: 1
  connectCluster: "mm-backup-cluster"
  clusters:
  - alias: "mm-src-cluster"
    authentication:
      certificateAndKey:
        certificate: user.crt
        key: user.key
        secretName: mm-consumer-user
      type: tls
    bootstrapServers: mm-src-cluster-kafka-bootstrap.kafka-mirror-src.svc:9093
    tls:
      trustedCertificates:
      - certificate: ca.crt
        secretName: mm-src-cluster-cluster-ca-cert
  - alias: "mm-backup-cluster"
    authentication:
      certificateAndKey:
        certificate: user.crt
        key: user.key
        secretName: mm-producer-user
      type: tls
    bootstrapServers: mm-backup-cluster-kafka-bootstrap.kafka-mirror.svc:9093
    config:
      group.id: connect-cluster
      config.storage.replication.factor: 1
      offset.storage.replication.factor: 1
      status.storage.replication.factor: 1
      offset.storage.topic: connect-cluster-offsets
      config.storage.topic: connect-cluster-configs
      status.storage.topic: connect-cluster-status

    tls:
      trustedCertificates:
      - certificate: ca.crt
        secretName: mm-backup-cluster-cluster-ca-cert
  mirrors:
  - sourceCluster: "mm-src-cluster"
    targetCluster: "mm-backup-cluster"
    sourceConnector:
      config:
        replication.factor: 1
        offset-syncs.topic.replication.factor: 1
        sync.topic.acls.enabled: "false"
        group.id: connect-cluster
        offset.storage.topic: connect-cluster-offsets
        config.storage.topic: connect-cluster-configs
        status.storage.topic: connect-cluster-status

    heartbeatConnector: 
      config:
        heartbeats.topic.replication.factor: 1 
    checkpointConnector: 
      config:
        checkpoints.topic.replication.factor: 1 
    topicsPattern: ".*"
    groupsPattern: ".*" 

1) group. id: connect-cluster.

For the same group I have given ACL in the kafkauser, other than this I am not doing anything.

- resource:
          type: group
          name: connect-cluster
          patternType: literal
        operation: Read
        host: "*"

Even after the user is given the group permission, groupauthorizationexception continues.

Any other ways i can check...

scholzj commented 4 years ago

It looks good to me. The ACL errors are normally logged by the brokers - so I think you should check the logs of the corresponding brokers to see what exactly is the client trying to do and match that against the ACLs you have given to the users.

vperi1730 commented 4 years ago

From the broker logs, I have noticed there is Principal = User:CN=mm02-prod-user is Denied Operation = Describe from host = 10.124.35.246 on resource = Topic:LITERAL:connect-cluster-offsets (kafka.authorizer.logger) [data-plane-kafka-request-handler-0].

However, my CR for the Kafka user has the support for this operation. Please correct me if I am going wrong somewhere.

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaUser
metadata:
  name: mm02-prod-user
  labels:
    strimzi.io/cluster: mm-backup-cluster
spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: '*'
          patternType: literal
        operation: Write
        host: "*"
      - resource:
          type: topic
          name: '*'
          patternType: literal
        operation: Read
        host: "*"
      - resource:
          type: topic
          name: '*'
          patternType: literal
        operation: Create
        host: "*"
      - resource:
          type: topic
          name: '*'
          patternType: literal
        operation: Describe
        host: "*"
      - resource:
          type: topic
          name: connect-cluster-offsets
          patternType: literal
        operation: Create
        host: "*"
      - resource:
          type: topic
          name: connect-cluster-offsets
          patternType: literal
        operation: Describe
        host: "*"
      - resource:
          type: group
          name: connect-cluster
          patternType: literal
        operation: Describe
        host: "*"

I remember using Operation: All for this scenario, same error is observed even then.

scholzj commented 4 years ago

Is the user and the ACLs created on the right broker? Is the error message from one of the mm-backup-cluster brokers?

vperi1730 commented 4 years ago

OK, the issue has been resolved now. MM2 is up and running, 1 another point which I have noticed is, topic replication happened successfully but the user replication didn't.

I want to know do we have any other property in the CR to be enabled apart from

sync.topic.acls.enabled: "true"

When I execute

kubectl get kafkauser -n kafka-mirror(target cluster) - Didn't see any of the Kafka users from the source cluster.

I need your help here to resolve.

scholzj commented 4 years ago

The Mirror Maker 2 has no understanding abotu our Kafka User resources. You can use the ACL replication only if you manage the ACL rights directly in Kafka and not using the User Operator.

vperi1730 commented 4 years ago

OK, I understand, Let's say in a real-time scenario if my Cluster1 has 1000 Kafka users and i have launched MM2 for replicating them into the Cluster2.

If I want to replicate all these users into Cluster2, Is MM2 not an ideal option to chose??, How can this be achieved as manual creation or creating some script to automate might not be an option all the time?

scholzj commented 4 years ago

It really depends on your environment and how you manage things. To mirror the KafkaUser resources, you would need to actually mirror the resources on Kubernetes level. Either using some custom tooling, or using some gitops approach etc.

scholzj commented 4 years ago

@vperi1730 Did this helped? So you still need anything or can we close this issue?

vperi1730 commented 4 years ago

This helped. Closing the issue.