Closed LoonyRules closed 1 year ago
This is the same issue as others on here. Just confirmed it exists in the latest readinessprobe 1.0.8 and operator chart version 0.7.3. The readinessprobe executable fails with exit code 1. The binary that is copied from the mongodb-kubernetes-readinessprobe image is 38MB and outputs absolutely nothing, so it's very hard to debug. I can't find the source for it anywhere.
If you replace the binary with an empty script (so that exit code=0), everything starts working!
Note: My cluster is in namespace "production"
kubectl exec -ti mongodb-0 -n production -c mongodb-agent -- bash -c 'mv /opt/scripts/readinessprobe /opt/scripts/readinessprobe_old'
kubectl exec -ti mongodb-0 -n production -c mongodb-agent -- bash -c 'echo "#!/usr/bin/env bash" > /opt/scripts/readinessprobe'
Removing the readiness probe is obviously not a solution for this. I'm guessing it's a bug in readinessprobe, but without any output or source code it's hard to say.
Sadly even hacking around the file's output didn't work for me. To begin with, accessing the pod shell indicates that the readinessprobe file just simply does not exist. After adding your commands, the new file still continues to not exist and I am unable to make any file in the directory myself as well. Perhaps this underlying issue is permissions related? Not sure...
I've updated the docker images too to their latest tags but to no avail. One thing I have noticed is that this is not a rare issue. Some users on the bitnami mongodb helm chart have this same underlying issue as this operator chart does. One place I don't seem to see this issue occurring on is the enterprise operator, of which you have to PAY for.
Are you sure you are in the correct container? I was pulling my hair out over this. A missing /opt/scripts folder was one of the symptoms I remember. If you don't specify the "mongodb-agent" container, it will take you to the "mongod" one which has no /opt/scripts.
If you have an /opt/scripts folder that is empty, it sounds like something really messed up with the init containers. In which case, your logs might help.
Doing my weekly look back at this issue to see if it's resolved and sadly it is not. Kind of sad that it seems to have been an issue for a long time and isn't being treated as crucial for this operator's functionality...
Are you sure you are in the correct container? I was pulling my hair out over this. A missing /opt/scripts folder was one of the symptoms I remember. If you don't specify the "mongodb-agent" container, it will take you to the "mongod" one which has no /opt/scripts.
If you have an /opt/scripts folder that is empty, it sounds like something really messed up with the init containers. In which case, your logs might help.
When just adding your commands to the commands the container executes on boot it didn't result in a fix. By executing your commands directly, it results in an even worse error:
(combined from similar events): Readiness probe errored: rpc error: code =
Unknown desc = failed to exec in container: failed to start exec
"<some random uid>": OCI
runtime exec failed: exec failed: container_linux.go:380: starting container
process caused: exec: "/opt/scripts/readinessprobe": stat
/opt/scripts/readinessprobe: no such file or directory: unknow
same issue occur with me anyone had solve this issue
same issue occur with me anyone had solve this issue
As @philip-nicholls mentioned you can try the scripts he made but they didn't work for me so I've quite literally had to disable readiness probes for mongo completely.
@priyolahiri I can see that you triaged this issue, is there literally any ETA you can provide on the internal status of this issue? Leaving the community in the dark about such a crucial issue is only ever going to give MongoDB itself a BAD reputation.
How you disable readiness probe
Best approach in the interim is to go back to v0.7.1 which seems reliable.
git clone --branch v0.7.1 --single-branch https://github.com/mongodb/mongodb-kubernetes-operator.git
What's happening with this? Is there a workaround? It's crippling me.
I ssh to pod and run the redness script measuring the execution time:
I have no name!@gmt-importer-mongodb-1:/$ time /opt/scripts/readinessprobe
real 0m10.070s
user 0m0.033s
sys 0m0.006s
Found out that is actually taking 10s to execute and the timeout was set to 1, then I found out in the documentation:
Under some circumstances it might be necessary to set your own custom values for the ReadinessProbe used by the MongoDB Community Operator. To do so, you should use the statefulSet attribute in resource.spec, as in the following provided example yaml file. Only those attributes passed will be set, for instance, given the following structure:
spec:
statefulSet:
spec:
template:
spec:
containers:
- name: mongodb-agent
readinessProbe:
failureThreshold: 40
initialDelaySeconds: 5
timeout: 30
EDIT
Never mind it worked for a while, now it´s returning 1 and failing
Did anyone manage to solve it?
Found the solution in issue https://github.com/mongodb/mongodb-kubernetes-operator/issues/651. At least in my situation I was using the same scram credential secret name for multiple users. This configuration make the operator to keep on generating the credentials and operator never complete the reconciliation. After I started using different secrets all good. readinessprobe started to work.
@nuvme-devops @BryanDollery @LoonyRules check whether you make the same mistake as I did. This might be the fix.
@LoonyRules I ran into this issue on a GKE cluster...
the only thing that worked for me was setting up a completely new installation (different name on the community CRD)...so that it also set up a new PV, and PVC, SVC, etc instead of using anything existing on the node file system...I think there is some issue/conflict there that causes the MongoDB container to crash
Just bumped into this issue when I started using the operator today.
The source for the readiness probe is here: https://github.com/mongodb/mongodb-kubernetes-operator/tree/master/cmd/readiness
I haven't had a chance to debug it yet, but will make a PR if I find the issue.
@guitcastro is right about the source of this problem
I put only one user in manifest and the issue disappeared
This issue is being marked stale because it has been open for 60 days with no activity. Please comment if this issue is still affecting you. If there is no change, this issue will be closed in 30 days.
Exactly same issue here.
We have a bunch of tests in this area, which seem to be fine. Having said that, I believe this is something in your environment. Could you please try to debug this further.
Also, please check the output of the kubectl describe pod <a pod that has problems with the probe>
. The readiness check output should be written somewhere in the events section.
@slaskawi have you tried to configure more than 1 user via manifest?
We have a bunch of tests in this area, which seem to be fine. Having said that, I believe this is something in your environment. Could you please try to debug this further.
Also, please check the output of the
kubectl describe pod <a pod that has problems with the probe>
. The readiness check output should be written somewhere in the events section.
Will do. I might be able to reproduce all steps, from cluster creation (GKE) up to the ReplicaSet deployment. Is there anything else specifically that you'd like me to debug? Happy to help.
@moatorres I actually noticed that I probably gave you wrong guidance to check the readiness output. Instead of doing kubectl describe
, please check the logs in the /var/log/mongodb-mms-automation/readiness.log
and post also the /var/log/mongodb-mms-automation/agent-health-status.json
file.
@slaskawi I am also having this problem, only with one of the 3 instances of my Mongo DB.
/var/log/mongodb-mms-automation/readiness.log
says
2022-12-17T22:44:42.449Z INFO build/main.go:71 Mongod is not ready
2022-12-17T22:44:52.439Z INFO build/main.go:71 Mongod is not ready
2022-12-17T22:45:02.433Z INFO build/main.go:71 Mongod is not ready
tail -f /healthstatus/agent-health-status.json
says
{"statuses":{"miles-davis-mongo-db-1":{"IsInGoalState":false,"LastMongoUpTime":0,"ExpectedToBeUp":true,"ReplicationStatus":-1}},"mmsStatus":{"miles-davis-mongo-db-1":{"name":"miles-davis-mongo-db-1","lastGoalVersionAchieved":-1,"plans":[{"started":"2022-12-17T21:45:55.077231307Z","completed":null,"moves":[{"move":"Start","moveDoc":"Start the process","steps":[{"step":"StartFresh","stepDoc":"Start a mongo instance (start fresh)","isWaitStep":false,"started":"2022-12-17T21:45:55.077256988Z","completed":null,"result":"error"}]},{"move":"WaitRsInit","moveDoc":"Wait for the replica set to be initialized by another member","steps":[{"step":"WaitRsInit","stepDoc":"Wait for the replica set to be initialized by another member","isWaitStep":true,"started":null,"completed":null,"result":""}]},{"move":"WaitFeatureCompatibilityVersionCorrect","moveDoc":"Wait for featureCompatibilityVersion to be right","steps":[{"step":"WaitFeatureCompatibilityVersionCorrect","stepDoc":"Wait for featureCompatibilityVersion to be right","isWaitStep":true,"started":null,"completed":null,"result":""}]}]}],"errorCode":0,"errorString":""}}}
Any help would be greatly appreciated.
@ammurdoch Looking at the longs, we can clearly see that Mongod didn't start successfully:
{"step":"StartFresh","stepDoc":"Start a mongo instance (start fresh)","isWaitStep":false,"started":"2022-12-17T21:45:55.077256988Z","completed":null,"result":"error"}
Could you please look into the /var/log/mongodb-mms-automation
directory and check whether there's anything suspicious there? There needs to be some trace why Mongod didn't start (or crashed?).
Hey @slaskawi, I really appreciate your reply. I was able to solve my problem. I was experiencing something related to "Slow Application of Oplog Entries". See https://www.mongodb.com/docs/manual/tutorial/troubleshoot-replica-sets/#slow-application-of-oplog-entries
Mongo wouldn't start up until it worked through half a years worth of op_logs. I imagine it was hitting some error at the end of the op_logs and kept restarting every several hours. This particular node had restarted ~4000 times since November.
It's still unclear to me what exactly was going on, but the solution was to delete everything in that particular mongo instance's /data
directory and let it refresh its data from the other two nodes instead of from op_logs.
Once I deleted everything in /data
it restarted and refreshed its data without any more trouble.
I'm glad this helped you @ammurdoch
Just for the future, if the readiness probe is causing problems and you're certain it's fine (like in this case), please use the statefulset override and disable the probe by returning true
from it. Alternatively, tweak its settings to match your needs.
I'm solving the ticket. Thanks for patience!
After fixing a privilege action error the readinessprobe returned 0 instead of 1. Then I set the readinessprobe.timeoutSeconds
to 30, because it was only 1, which is very low. If you run it in a shell connected to the mongodb-agent
container it takes more than a second. Now it works.
I also use Kyverno and it seems that admission reports are being generated constantly.
Found the solution in issue #651. At least in my situation I was using the same scram credential secret name for multiple users. This configuration make the operator to keep on generating the credentials and operator never complete the reconciliation. After I started using different secrets all good. readinessprobe started to work.
@nuvme-devops @BryanDollery @LoonyRules check whether you make the same mistake as I did. This might be the fix.
I can confirm this was also the reason for my problems with the pods restarting. I had 3 users defined in the replicaset yaml, all of which had the same name for the scramCredentialsSecretName property.
I've been trying to solve this for 2 days, and I had noticed the connection secrets were not being created, but assumed the documentation was not upto date.
Thanks for the info.
Still facing the same issue. any updates?
@steddyman Thanks you. found the solution here #651. I was using same secret for two users. remove the user resolved the problem.
@guitcastro Same problem, do you find out why healthy check so slow?
@cdivitotawela, my MongoDB Community replica set was in the pending state, and the MongoDB pods were failing to initialize the agent. The fix you suggested nailed it! I had also configured two users with the same SCRAM secret. Changing the secret name solved the issue. Now, the replica set is running, and the pods are healthy. Thank you!
What did you do to encounter the bug? Steps to reproduce the behavior:
MongoDBCommunity
CRD.What did you expect? ReplicaSet to boot successfully and for readiness and liveness probes to complete successfully.
What happened instead? Readiness probe warning comes up then disappears. A minute later after both pods have booted and the service is accessible via MongoDB Compass, a kubernetes event is triggered claiming the readiness probe timed out. The pod does not restart, the replicaset is still healthy but this made kubernetes unhappy. Longhorn then throws a fit due to the error and puts the logs-volume in a readonly state.
Screenshots If applicable, add screenshots to help explain your problem.
Operator Information
mongod: docker.io/mongo:5.0.6
mongodb-agent: quay.io/mongodb/mongodb-agent:11.0.5.6963-1
Kubernetes Cluster Information
Additional context I found an issue that is very similar to what I am coming across (https://github.com/mongodb/mongodb-kubernetes-operator/issues/668) but with the exception of increasing the readiness probe variables does nothing but just make the issue happen further in the future. At some point, the readiness probe will run and will fail.
When accessing the pod shell of mongodb-0 or mongodb-1 (as they both fail), running
/opt/scripts/readinessprobe
returns a "no such file or directory exists" error so I'm assuming the underlying issue could be that the readinessprobe script it's trying to run doesn't actually exist?cat /var/log/mongodb-mms-automation/readiness.log
returns a lot of:of which backs up my point that the replica is happy when this event occurs.
cat /healthstatus/agent-health-status.json
results:If possible, please include:
readiness probe time out event: note: this happens to both
mongodb-0
andmongodb-1
pods at almost the same time because that's when they booted.mongodb
CRDdatabase pods mongodb container just prints this every now and then, nothing else: