AtlasDatabaseUser connection secrets removed after upgrading

sunib commented 3 days ago

What did you do to encounter the bug? We upgraded mongodb-atlas-kubernetes to version 2.5.0 (we came from version 2.3.1). This failed pretty hard and caused downtime for our customers. Somehow the connection secrets were removed for most of our AtlasDatabaseUser objects. This caused our depending deployments to not work anymore. Reverting to 2.3.1 resolved the issues.

What did you expect?

The first reconciliation with 2.5.0 removed random connection secrets. We have a setup where:

We use serverless instances.
We have one AtlasDeployment per namespace
We have multiple AtlasDatabaseUser crds in our namespace. One for each application, which also gives the needed rights.
We expect the same number of connection secrets in our namespace as AtlasDatabaseUsers.
We have a lot off AtlasDatabaseUser objects over the whole cluster: so the syncing takes it's time (no problem but it made it more difficult to pinpoint).
This all works fine with 2.3.1

What happened instead?

We somehow only had one connection secret per namespace after 2.5.0 started (instead of 5): and it showed weird behavior. It looked like the secrets where deleted. Enabling the debug logs showed that this was indeed right (see next Alinea).

Screenshots

Operator Information Operator 2.3.1 does work (we downgraded). Operator 2.5.0 does not.

Kubernetes Cluster Information We run Kubernetes 1.30 in EKS.

Probable cause

I think the bug is in this piece of code. The ListClusters is using this library under the hood. Which explicitly states that serverless instances are not returned in this call.
In our situation - we only have serverless clusters - this means that the list is empty and that all connection secrets are marked as orphaned.

If this is all true, then I do have a request: could you please add a log line at warning level when connection secrets are deleted. I have carefully tested this upgrade on my test cluster and I would have seen these lines.

josvazg commented 3 days ago

Do you happen to have a minimal YAML that reproduce the issue?

If not, we can try one. Let me clarify if this is the setup that would mimic yours:

The Operator in one namespace. Does it uses global credentials to access Atlas?
An Atlas project + Atlas deployment per user namespace, and several database users also per namespace. DB user secrets referred in the same namespace, I suppose right?

Another question, was 2.4.1 also failing for you?

josvazg commented 3 days ago

I am currently suspecting PR #1856

sunib commented 3 days ago

Thank you for the quick responses.

We have one Atlas operator that manages all namespaces. The operator uses global credentials for accessing the Atlas API.
These 5 secrets in the logs (where 4 are deleted) are indeed in one specific namespace. It occurred in all namespaces.

The very short version to replicate this is (I had to redact some stuff):

We have one AtlasProject in the same namespace as the operator:

apiVersion: atlas.mongodb.com/v1
kind: AtlasProject
metadata:
  name: mongodb-atlas-project
spec:
  name: {{ $teamName }}
  teams:
  - teamRef:
      name: {{ $teamName }}
    roles:
    - GROUP_DATA_ACCESS_READ_WRITE
    - GROUP_OWNER
  connectionSecretRef:
    name: atlas-secret
  maintenanceWindow: {}
  projectIpAccessList:
  {{- toYaml .Values.projectIpAccessList | nindent 2 }}
  withDefaultAlertsSettings: true

Then in n namespaces we apply this (for now simplified to two users):

apiVersion: atlas.mongodb.com/v1
kind: AtlasDeployment
metadata:
  annotations:
    mongodb.com/last-applied-configuration: 'redacted'
  creationTimestamp: "2024-08-23T18:52:32Z"
  finalizers:
  - mongodbatlas/finalizer
  generation: 2
  name: atlas
  namespace: tenant-namespace
  resourceVersion: "redacted"
  uid: redacted
spec:
  backupRef:
    name: ""
    namespace: ""
  projectRef:
    name: mongodb-atlas-project
    namespace: mongodb-atlas
  serverlessSpec:
    backupOptions:
      serverlessContinuousBackupEnabled: true
    name: unique-cluster-name-over-ns
    providerSettings:
      backingProviderName: AWS
      providerName: SERVERLESS
      regionName: EU_WEST_1
    tags:
    - key: application
      value: the-application-name
    terminationProtectionEnabled: true
---
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
  annotations:
    mongodb.com/atlas-resource-version-policy: allow
    mongodb.com/last-applied-configuration: 'redacted'
  creationTimestamp: "2024-08-23T18:54:20Z"
  finalizers:
  - mongodbatlas/finalizer
  generation: 1
  name: user1-unique-over-ns
  namespace: tenant-namespace
  resourceVersion: "129059651"
  uid: 5f7d8843-815a-495c-8e0f-9aac6f1bbf8d
spec:
  awsIamType: NONE
  databaseName: admin
  oidcAuthType: NONE
  passwordSecretRef:
    name: passsword1
  projectRef:
    name: mongodb-atlas-project
    namespace: mongodb-atlas
  roles:
  - databaseName: database1
    roleName: readWrite
  scopes:
  - name: unique-cluster-name-over-ns
    type: CLUSTER
  username: user1-unique-over-ns
  x509Type: NONE
---
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
  annotations:
    mongodb.com/atlas-resource-version-policy: allow
    mongodb.com/last-applied-configuration: 'redacted'
  creationTimestamp: "2024-08-23T18:54:21Z"
  finalizers:
  - mongodbatlas/finalizer
  generation: 1
  name: user2-unique-over-ns
  namespace: tenant-namespace
spec:
  awsIamType: NONE
  databaseName: admin
  oidcAuthType: NONE
  passwordSecretRef:
    name: password2
  projectRef:
    name: mongodb-atlas-project
    namespace: mongodb-atlas
  roles:
  - databaseName: database2
    roleName: readWrite
  scopes:
  - name: unique-cluster-name-over-ns
    type: CLUSTER
  username: user2-unique-over-ns
  x509Type: NONE

sunib commented 3 days ago

I am currently suspecting PR #1856

I agree.

And 2.4.1 was also failing for us (same behavior).

josvazg commented 3 days ago

Wait, if 2.4.1 was also failing, then that PR could not be. The PR mas merged on Oct 11 and v.2.4.1 was released earlier, in August.

Anyway, thanks for the sample YAMLs. I will try to reproduce with them. Do not worry about too many details on them, I just need to reproduce where the resources are and, most likely where the Kubernetes resources are referenced each other (same namespace, across namespaces, etc) so that I get the same issue.

I suspect the issue might be that the project is not in the same namespace as the secrets. So the code might believe they are somehow unused. Still, that would not yet match with 2.4.1 being also broken.

josvazg commented 3 days ago

I have been able to reproduce, I will be working on this shortly.

mongodb / mongodb-atlas-kubernetes

AtlasDatabaseUser connection secrets removed after upgrading #1954