Open sunib opened 3 days ago
Do you happen to have a minimal YAML that reproduce the issue?
If not, we can try one. Let me clarify if this is the setup that would mimic yours:
Another question, was 2.4.1 also failing for you?
I am currently suspecting PR #1856
Thank you for the quick responses.
The very short version to replicate this is (I had to redact some stuff):
We have one AtlasProject in the same namespace as the operator:
apiVersion: atlas.mongodb.com/v1
kind: AtlasProject
metadata:
name: mongodb-atlas-project
spec:
name: {{ $teamName }}
teams:
- teamRef:
name: {{ $teamName }}
roles:
- GROUP_DATA_ACCESS_READ_WRITE
- GROUP_OWNER
connectionSecretRef:
name: atlas-secret
maintenanceWindow: {}
projectIpAccessList:
{{- toYaml .Values.projectIpAccessList | nindent 2 }}
withDefaultAlertsSettings: true
Then in n namespaces we apply this (for now simplified to two users):
apiVersion: atlas.mongodb.com/v1
kind: AtlasDeployment
metadata:
annotations:
mongodb.com/last-applied-configuration: 'redacted'
creationTimestamp: "2024-08-23T18:52:32Z"
finalizers:
- mongodbatlas/finalizer
generation: 2
name: atlas
namespace: tenant-namespace
resourceVersion: "redacted"
uid: redacted
spec:
backupRef:
name: ""
namespace: ""
projectRef:
name: mongodb-atlas-project
namespace: mongodb-atlas
serverlessSpec:
backupOptions:
serverlessContinuousBackupEnabled: true
name: unique-cluster-name-over-ns
providerSettings:
backingProviderName: AWS
providerName: SERVERLESS
regionName: EU_WEST_1
tags:
- key: application
value: the-application-name
terminationProtectionEnabled: true
---
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
annotations:
mongodb.com/atlas-resource-version-policy: allow
mongodb.com/last-applied-configuration: 'redacted'
creationTimestamp: "2024-08-23T18:54:20Z"
finalizers:
- mongodbatlas/finalizer
generation: 1
name: user1-unique-over-ns
namespace: tenant-namespace
resourceVersion: "129059651"
uid: 5f7d8843-815a-495c-8e0f-9aac6f1bbf8d
spec:
awsIamType: NONE
databaseName: admin
oidcAuthType: NONE
passwordSecretRef:
name: passsword1
projectRef:
name: mongodb-atlas-project
namespace: mongodb-atlas
roles:
- databaseName: database1
roleName: readWrite
scopes:
- name: unique-cluster-name-over-ns
type: CLUSTER
username: user1-unique-over-ns
x509Type: NONE
---
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
annotations:
mongodb.com/atlas-resource-version-policy: allow
mongodb.com/last-applied-configuration: 'redacted'
creationTimestamp: "2024-08-23T18:54:21Z"
finalizers:
- mongodbatlas/finalizer
generation: 1
name: user2-unique-over-ns
namespace: tenant-namespace
spec:
awsIamType: NONE
databaseName: admin
oidcAuthType: NONE
passwordSecretRef:
name: password2
projectRef:
name: mongodb-atlas-project
namespace: mongodb-atlas
roles:
- databaseName: database2
roleName: readWrite
scopes:
- name: unique-cluster-name-over-ns
type: CLUSTER
username: user2-unique-over-ns
x509Type: NONE
I am currently suspecting PR #1856
I agree.
And 2.4.1 was also failing for us (same behavior).
Wait, if 2.4.1 was also failing, then that PR could not be. The PR mas merged on Oct 11 and v.2.4.1 was released earlier, in August.
Anyway, thanks for the sample YAMLs. I will try to reproduce with them. Do not worry about too many details on them, I just need to reproduce where the resources are and, most likely where the Kubernetes resources are referenced each other (same namespace, across namespaces, etc) so that I get the same issue.
I suspect the issue might be that the project is not in the same namespace as the secrets. So the code might believe they are somehow unused. Still, that would not yet match with 2.4.1 being also broken.
I have been able to reproduce, I will be working on this shortly.
What did you do to encounter the bug? We upgraded mongodb-atlas-kubernetes to version 2.5.0 (we came from version 2.3.1). This failed pretty hard and caused downtime for our customers. Somehow the connection secrets were removed for most of our AtlasDatabaseUser objects. This caused our depending deployments to not work anymore. Reverting to 2.3.1 resolved the issues.
What did you expect?
The first reconciliation with 2.5.0 removed random connection secrets. We have a setup where:
What happened instead?
We somehow only had one connection secret per namespace after 2.5.0 started (instead of 5): and it showed weird behavior. It looked like the secrets where deleted. Enabling the debug logs showed that this was indeed right (see next Alinea).
Screenshots
Operator Information Operator 2.3.1 does work (we downgraded). Operator 2.5.0 does not.
Kubernetes Cluster Information We run Kubernetes 1.30 in EKS.
Probable cause
If this is all true, then I do have a request: could you please add a log line at warning level when connection secrets are deleted. I have carefully tested this upgrade on my test cluster and I would have seen these lines.