kafka-ops / julie

A solution to help you build automation and gitops in your Apache Kafka deployments. The Kafka gitops!
MIT License
418 stars 113 forks source link

Detect divergences between local state and the remote cluster current status #478

Closed purbon closed 2 years ago

purbon commented 2 years ago

This PR introduced a new feature to detect changes between the local state and the remote cluster status. For this first version, JulieOps is going to raise an exception and 🤯 until the divergence is fixed either by updating the local state or fixing the remote cluster. Note, in future versions this behaviour will be expected with more granular support.

Managers supported:

Note: Changing resources outside the scope of JulieOps it is not a good practice, but this PR will help teams detect such cases in case they happen.

sverrehu commented 2 years ago

Hi, again! Thanks for including my patch in 4.1.3. However, this divergence detector that was also included, broke the backwards compatibility again. We run JulieOps with allow.delete.topics=false, because we have an outside process for handling obsolete topics. With 4.1.3 JulieOps thus throws an exception because the state contains these topics, while the cluster does not. Due to the way we use JulieOps, this is as expected, and not an error.

How should we handle this? Can the divergency checker be made optional?

akselh commented 2 years ago

@purbon , as Sverre mentions this feature causes some issues.

  1. For the case with allow.delete.topics=false the normal case will be to remove topics from topology first and then remove them completely, new runs of JulieOps might easily be triggered in between in a multi tenant setup. So when allow.delete.topics=false this feature should not throw an exception.
  2. A more general issue is the case if the JulieOps process for some reason crashes during an execution, with some updates being done towards the cluster and some not... In this situation we will be in a deadlock with no easy way out.
    • Normal way to handle it would be to just run JulieOps one more time.

So at least this feature should be controlled by a feature flag. Or maybe just log these as errors/warnings without terminating JulieOps?

Side note: As you state "Changing resources outside the scope of JulieOps it is not a good practice". However I think the way to handle this is to bootstrap ACLs correctly for the cluster. Only the JulieOps internal/admin user should be allowed to change topics/ACLs etc after cluster installation/setup.

egarjans commented 2 years ago

Hello!

I'm wondering, if anybody else is having problems managing schema-registry permissions using JulieOps with RBAC provider after this change (i am using 4.2.0 version, confluent platform 7.0). Permissions are created on first run, but on next executions it fails with error:

com.purbon.kafka.topology.exceptions.RemoteValidationException: Your remote state has changed since the last execution, this ACL(s): 'Subject', 'test.ega.topic-value', '*', 'ResourceOwner', 'User:egarjans', 'LITERAL' are in your local state, but not in the cluster, please investigate! at com.purbon.kafka.topology.AccessControlManager.detectDivergencesInTheRemoteCluster(AccessControlManager.java:110) at com.purbon.kafka.topology.AccessControlManager.loadActualClusterStateIfAvailable(AccessControlManager.java:89) at com.purbon.kafka.topology.AccessControlManager.updatePlan(AccessControlManager.java:72) at com.purbon.kafka.topology.JulieOps.run(JulieOps.java:200) at com.purbon.kafka.topology.JulieOps.run(JulieOps.java:225) at com.purbon.kafka.topology.CommandLineInterface.processTopology(CommandLineInterface.java:212) at com.purbon.kafka.topology.CommandLineInterface.run(CommandLineInterface.java:161) at com.purbon.kafka.topology.CommandLineInterface.main(CommandLineInterface.java:147)

descriptor.yml fails looks like this:

context: test
# source: source
projects:
  - name: ega
    schemas:
      - principal: "User:egarjans"
        subjects:
          - "test.ega.topic-value"
        role: "ResourceOwner"

.cluster-state contain new acl:

{
    "resourceType" : "Subject",
    "resourceName" : "test.ega.topic-value",
    "host" : "*",
    "operation" : "ResourceOwner",
    "principal" : "User:egarjans",
    "pattern" : "LITERAL",
    "scope" : {
      "clusters" : {
        "kafka-cluster" : "9OpGFe2SSQC9HiEFXSBCpw",
        "schema-registry-cluster" : "schema-registry"
      },
      "resources" : [ {
        "name" : "test.ega.topic-value",
        "patternType" : "LITERAL",
        "resourceType" : "Subject"
      } ]
    }
  },

From confluent cli i can see that permissions exist on cluster, but validation still fails:

egarjans@LTPF2M88JH:~$ confluent iam rolebinding list --kafka-cluster-id 9OpGFe2SSQC9HiEFXSBCpw --schema-registry-cluster-id schema-registry --principal "User:adm.e.garjans" --role ResourceOwner
      Principal      |     Role      | ResourceType |         Name         | PatternType
+--------------------+---------------+--------------+----------------------+-------------+
  User:egarjans | ResourceOwner | Subject      | test.ega.topic-value | LITERAL