mongodb / mongodb-atlas-kubernetes

MongoDB Atlas Kubernetes Operator - Manage your MongoDB Atlas clusters from Kubernetes
http://www.mongodb.com/cloud/atlas
Apache License 2.0
152 stars 78 forks source link

Operator fails while reconciling cluster autoscaling events #606

Closed sunchill06 closed 2 years ago

sunchill06 commented 2 years ago

What did you do to encounter the bug? We had existing atlas clusters created/managed by Atlas operator Alpha version. While upgrading to the operator's GA version, we changed the k8s manifest w.r.t advancedDeploymentSpec of the AtlasDeployment CR spec. This new spec had instance size changed from M10 to M40 (which was within the autoscaling limits of the existing atlas cluster).

Previous Spec:

    autoScaling:
      compute:
        enabled: true
        scaleDownEnabled: true
      diskGBEnabled: true
    diskSizeGB: 40
    mongoDBMajorVersion: "4.4"
    numShards: 1
    pitEnabled: false
    providerBackupEnabled: true
    providerSettings:
      autoScaling:
        compute:
          maxInstanceSize: M40
          minInstanceSize: M10
      instanceSizeName: M10
      providerName: GCP
      regionName: WESTERN_EUROPE

New Spec:

    advancedDeploymentSpec:
      backupEnabled: true
      diskSizeGB: 40
      mongoDBMajorVersion: "4.4"
      pitEnabled: false
      replicationSpecs:
      - numShards: 1
        regionConfigs:
        - electableSpecs:
            instanceSize: M40
          autoScaling:
            compute:
              enabled: true
              maxInstanceSize: M40
              minInstanceSize: M10
              scaleDownEnabled: true
            diskGBEnabled: true
          providerName: GCP
          regionName: WESTERN_EUROPE

This resulted in a confusing error:

{"level":"INFO","time":"2022-07-15T10:22:38.851Z","msg":"-> Starting AtlasDeployment reconciliation","atlasdeployment":"platform/singletons-locking","spec":{"projectRef":{"name":"singletons-europe-west1.gcp.escemo.com-1","namespace":""},"advancedDeploymentSpec":{"backupEnabled":true,"diskSizeGB":40,"mongoDBMajorVersion":"4.4","name":"locking","pitEnabled":false,"replicationSpecs":[{"numShards":1,"regionConfigs":[{"electableSpecs":{"instanceSize":"M40"},"autoScaling":{"diskGBEnabled":true,"compute":{"enabled":true,"scaleDownEnabled":true,"minInstanceSize":"M10","maxInstanceSize":"M40"}},"providerName":"GCP","regionName":"WESTERN_EUROPE"}]}]},"backupRef":{"name":"","namespace":""}},"status":{"conditions":[{"type":"Ready","status":"False","lastTransitionTime":"2022-07-14T16:26:30Z"},{"type":"ValidationSucceeded","status":"True","lastTransitionTime":"2022-07-14T16:26:30Z"},{"type":"DeploymentReady","status":"False","lastTransitionTime":"2022-07-14T16:26:30Z","reason":"DeploymentNotUpdatedInAtlas","message":"PATCH https://cloud.mongodb.com/api/atlas/v1.5/groups/60db89b8c5283149d39324aa/clusters/locking: 400 (request \"ATTRIBUTE_READ_ONLY\") The attribute createDate is read-only and cannot be changed by the user."}],"observedGeneration":1,"stateName":"UPDATING"}}
{"level":"INFO","time":"2022-07-15T10:22:38.851Z","msg":"Reading Atlas API credentials from the AtlasProject Secret platform/singletons-europe-west1.gcp.escemo.com-1-api-key","atlasdeployment":"platform/singletons-locking"}

We got to know about this when we ran the operator in debug mode and saw this in the logs:

{"level":"DEBUG","time":"2022-07-15T10:22:39.069Z","msg":"Deployments are different:   mongodbatlas.AdvancedCluster{\n  \t... // 13 identical fields\n  \tPitEnabled: &false,\n  \tStateName:  \"IDLE\",\n  \tReplicationSpecs: []*mongodbatlas.AdvancedReplicationSpec{\n  \t\t&{\n  \t\t\tNumShards: 1,\n  \t\t\tID:        \"618cf424edab8f1a9d8c444a\",\n  \t\t\tZoneName:  \"Zone 1\",\n  \t\t\tRegionConfigs: []*mongodbatlas.AdvancedRegionConfig{\n  \t\t\t\t&{\n  \t\t\t\t\tAnalyticsSpecs: &{InstanceSize: \"M10\", NodeCount: &0},\n  \t\t\t\t\tElectableSpecs: &mongodbatlas.Specs{\n  \t\t\t\t\t\tDiskIOPS:      nil,\n  \t\t\t\t\t\tEbsVolumeType: \"\",\n- \t\t\t\t\t\tInstanceSize:  \"M10\",\n+ \t\t\t\t\t\tInstanceSize:  \"M40\",\n  \t\t\t\t\t\tNodeCount:     &3,\n  \t\t\t\t\t},\n  \t\t\t\t\tReadOnlySpecs: &{InstanceSize: \"M10\", NodeCount: &0},\n  \t\t\t\t\tAutoScaling:   &{DiskGB: &{Enabled: &true}, Compute: &{Enabled: &true, ScaleDownEnabled: &true, MinInstanceSize: \"M10\", MaxInstanceSize: \"M40\"}},\n  \t\t\t\t\t... // 4 identical fields\n  \t\t\t\t},\n  \t\t\t},\n  \t\t},\n  \t},\n  \tCreateDate:           \"2021-11-11T10:44:52Z\",\n  \tRootCertType:         \"ISRGROOTX1\",\n  \tVersionReleaseSystem: \"LTS\",\n  }\n","atlasdeployment":"platform/singletons-locking"}

What did you expect?

  1. The AtlasDeployment should have been created successfully.
  2. If not, ideally the error message should be more appropriate.
  3. Also there could be cases where someone might want to change the instance size while upgrading.

Operator Information

kubectl describe output

Status:
  Conditions:
    Last Transition Time:  2022-07-15T10:25:16Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2022-07-14T16:26:30Z
    Status:                True
    Type:                  ValidationSucceeded
    Last Transition Time:  2022-07-15T10:25:08Z
    Status:                True
    Type:                  DeploymentReady
  Connection Strings:
    Standard:           mongodb://xxxx:27017,xxxxx:27017,xxxxx:27017/?ssl=true&authSource=admin&replicaSet=atlas-xx-shard-0
    Standard Srv:       mongodb+srv://xxxxxx
  Mongo DB Version:     4.4.15
  Observed Generation:  2
  State Name:           IDLE
Events:
  Type     Reason                       Age                     From             Message
  ----     ------                       ----                    ----             -------
  Warning  DeploymentNotUpdatedInAtlas  12m (x5875 over 17h)    AtlasDeployment  PATCH https://cloud.mongodb.com/api/atlas/v1.5/groups/xxxxxxxxxxxxxxx/clusters/locking: 400 (request "ATTRIBUTE_READ_ONLY") The attribute createDate is read-only and cannot be changed by the user.
sunchill06 commented 2 years ago

@fabritsius, I would like to draw your attention to this issue. This seems to be a serious problem. We observed one more problem recently:

A cluster was autoscaled from M10 -> M20. The Atlas Operator tried to reconcile this and failed with the similar error:

{"level":"INFO","time":"2022-07-19T09:03:30.344Z","msg":"-> Starting AtlasDeployment reconciliation","atlasdeployment":"platform/projects-projects-rs6","spec":{"projectRef":{"name":"projects-xxxxxxx","namespace":""},"advancedDeploymentSpec":{"backupEnabled":true,"diskSizeGB":10,"mongoDBMajorVersion":"4.4","name":"projects-rs6","pitEnabled":false,"replicationSpecs":[{"numShards":1,"regionConfigs":[{"electableSpecs":{"instanceSize":"M10"},"autoScaling":{"diskGBEnabled":true,"compute":{"enabled":true,"scaleDownEnabled":true,"minInstanceSize":"M10","maxInstanceSize":"M40"}},"providerName":"GCP","regionName":"WESTERN_EUROPE"}]}]},"backupRef":{"name":"","namespace":""}},"status":{"conditions":[{"type":"Ready","status":"False","lastTransitionTime":"2022-07-19T08:27:25Z"},{"type":"ValidationSucceeded","status":"True","lastTransitionTime":"2022-07-14T16:26:27Z"},{"type":"DeploymentReady","status":"False","lastTransitionTime":"2022-07-19T08:27:25Z","reason":"DeploymentNotUpdatedInAtlas","message":"PATCH https://cloud.mongodb.com/api/atlas/v1.5/groups/xxxxxxxxx/clusters/projects-rs6: 400 (request \"ATTRIBUTE_READ_ONLY\") The attribute createDate is read-only and cannot be changed by the user."}],"observedGeneration":1,"stateName":"IDLE","mongoDBVersion":"4.4.15","connectionStrings":{"standard":"mongodb://projects-rs6-xxxxxxx.mongodb.net:27017,projects-rs6-xxxxxxx:27017,projects-rs6-xxxxxxx:27017/?ssl=true&authSource=admin&replicaSet=atlas-vavo0p-shard-0","standardSrv":"mongodb+srv://projects-rs6.xxxxxxx"}}}
{"level":"INFO","time":"2022-07-19T09:03:30.344Z","msg":"Reading Atlas API credentials from the AtlasProject Secret platform/projects-xxxxxx-api-key","atlasdeployment":"platform/projects-projects-rs6"}
{"level":"INFO","time":"2022-07-19T09:03:30.842Z","msg":"Status update","atlasdeployment":"platform/projects-projects-rs6","lastCondition":{"type":"DeploymentReady","status":"False","lastTransitionTime":null,"reason":"DeploymentNotUpdatedInAtlas","message":"PATCH https://cloud.mongodb.com/api/atlas/v1.5/groups/xxxxxxx/clusters/projects-rs6: 400 (request \"ATTRIBUTE_READ_ONLY\") The attribute createDate is read-only and cannot be changed by the user."}}

Concerns:

  1. Should an auto-scaling event be reconciled by the operator? As there is nothing changed in the Deployment Spec from our side.
  2. Are we supposed to pass createDate as part of the Deployment Spec? If yes, then what about when we are creating a new cluster?

Any efforts to prioritise this would be highly appreciated as this seems like a blocker for us. I would be happy to share any further details if required.

fabritsius commented 2 years ago

Hey @sunchill06,

Thanks for reporting thisπŸ™ Looks like a bug, will fix it ASAP.

fabritsius commented 2 years ago

I've just merged the PR #615 which should solve this issue. We are planning a release this week. Feel free to reopen this if the issue persists. Thanks πŸ™

sunchill06 commented 2 years ago

Thanks @fabritsius. This would be really appreciated as we are unable to progress the rollout because of this issue.

yzdann commented 2 years ago

Hey @fabritsius Thanks a lot for your work on this issue! πŸš€

One thing about the first question that @sunchill06 asked before:

Should an auto-scaling event be reconciled by the operator? As there is nothing changed in the Deployment Spec from our side.

Please Correct me if I'm wrong but I think we still need some logic to unset disk size and instance size on patch requests to a cluster with compute/disk autoscaling!

As far as I understand this PR https://github.com/mongodb/mongodb-atlas-kubernetes/pull/615 only fixes the createDate problem.

sunchill06 commented 2 years ago

Hey @fabritsius, when can we expect the release containing this fix please?