Replicaset always in "is not yet ready - mondodb-agent shows => the Agent hasn't reported working on the new config yet, the last plan finished

marandalucas commented 1 year ago

What did you do to encounter the bug?

We've got a MongoDB Replicaset with issues to start up.. (It already has important data)

$ kubectl get mongodbcommunity -o yaml
    message: ReplicaSet is not yet ready, retrying in 10 seconds

$ kubectl get po
mongodb-rs-srv-0    1/2     Running   0          45m

$ kubectl logs -f mongodb-rs-srv-0 -c mongod                
2023-03-14T16:37:27.437Z        INFO    versionhook/main.go:32  Running version change post-start hook
2023-03-14T16:37:27.438Z        INFO    versionhook/main.go:39  Waiting for agent health status...
2023-03-14T16:37:28.438Z        INFO    versionhook/main.go:45  Agent health status file not found, mongod will start
{"t":{"$date":"2023-03-14T16:37:33.462+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2023-03-14T16:37:33.464+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."}
{"t":{"$date":"2023-03-14T16:37:33.524+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":1,"port":27017,"dbPath":"/data","architecture":"64-bit","host":"mongodb-rs-srv-0"}}
{"t":{"$date":"2023-03-14T16:37:33.524+00:00"},"s":"W",  "c":"CONTROL",  "id":20720,   "ctx":"initandlisten","msg":"Available memory is less than system memory","attr":{"availableMemSizeMB":4096,"systemMemSizeMB":64320}}
{"t":{"$date":"2023-03-14T16:37:33.524+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.18","gitVersion":"8ed32b5c2c68ebe7f8ae2ebe8d23f36037a17dea","openSSLVersion":"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distmod":"ubuntu2004","distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2023-03-14T16:37:33.524+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Ubuntu","version":"20.04"}}}
{"t":{"$date":"2023-03-14T16:37:33.524+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/data/automation-mongod.conf","net":{"bindIp":"0.0.0.0","port":27017},"replication":{"replSetName":"mongodb-rs-srv"},"security":{"authorization":"enabled","keyFile":"/var/lib/mongodb-mms-automation/authentication/keyfile"},"setParameter":{"authenticationMechanisms":"SCRAM-SHA-1,SCRAM-SHA-256"},"storage":{"dbPath":"/data","engine":"wiredTiger","wiredTiger":{"engineConfig":{"cacheSizeGB":0.25}}}}}}
{"t":{"$date":"2023-03-14T16:37:33.525+00:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2023-03-14T16:37:33.525+00:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=256M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
{"t":{"$date":"2023-03-14T16:37:34.035+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:35476][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 39 through 40"}}
{"t":{"$date":"2023-03-14T16:37:34.134+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:134580][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 40 through 40"}}
{"t":{"$date":"2023-03-14T16:37:34.268+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:268230][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 39/79616 to 40/256"}}
{"t":{"$date":"2023-03-14T16:37:34.268+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:268866][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 39 through 40"}}
{"t":{"$date":"2023-03-14T16:37:34.339+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:339899][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 40 through 40"}}
{"t":{"$date":"2023-03-14T16:37:34.401+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:401257][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1678806686, 4)"}}
{"t":{"$date":"2023-03-14T16:37:34.401+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:401315][1:0x7fa6f749acc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1678806686, 4)"}}
{"t":{"$date":"2023-03-14T16:37:34.436+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1678811854:436634][1:0x7fa6f749acc0], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 12, snapshot max: 12 snapshot count: 0, oldest timestamp: (1678806686, 4) , meta checkpoint timestamp: (1678806686, 4) base write gen: 1401200"}}
{"t":{"$date":"2023-03-14T16:37:34.445+00:00"},"s":"I",  "c":"STORAGE",  "id":4795906, "ctx":"initandlisten","msg":"WiredTiger opened","attr":{"durationMillis":920}}
{"t":{"$date":"2023-03-14T16:37:34.445+00:00"},"s":"I",  "c":"RECOVERY", "id":23987,   "ctx":"initandlisten","msg":"WiredTiger recoveryTimestamp","attr":{"recoveryTimestamp":{"$timestamp":{"t":1678806686,"i":4}}}}
{"t":{"$date":"2023-03-14T16:37:34.456+00:00"},"s":"I",  "c":"STORAGE",  "id":22383,   "ctx":"initandlisten","msg":"The size storer reports that the oplog contains","attr":{"numRecords":3240673,"dataSize":1022697730}}
{"t":{"$date":"2023-03-14T16:37:34.456+00:00"},"s":"I",  "c":"STORAGE",  "id":22386,   "ctx":"initandlisten","msg":"Sampling the oplog to determine where to place markers for truncation"}
{"t":{"$date":"2023-03-14T16:37:34.458+00:00"},"s":"I",  "c":"STORAGE",  "id":22389,   "ctx":"initandlisten","msg":"Sampling from the oplog to determine where to place markers for truncation","attr":{"from":{"$timestamp":{"t":1677814072,"i":1}},"to":{"$timestamp":{"t":1678806692,"i":2}}}}
{"t":{"$date":"2023-03-14T16:37:34.458+00:00"},"s":"I",  "c":"STORAGE",  "id":22390,   "ctx":"initandlisten","msg":"Taking samples and assuming each oplog section contains","attr":{"numSamples":600,"containsNumRecords":53926,"containsNumBytes":17018069}}
{"t":{"$date":"2023-03-14T16:37:34.901+00:00"},"s":"I",  "c":"STORAGE",  "id":22393,   "ctx":"initandlisten","msg":"Oplog sampling complete"}
{"t":{"$date":"2023-03-14T16:37:34.901+00:00"},"s":"I",  "c":"STORAGE",  "id":22382,   "ctx":"initandlisten","msg":"WiredTiger record store oplog processing finished","attr":{"durationMillis":444}}
{"t":{"$date":"2023-03-14T16:37:34.976+00:00"},"s":"I",  "c":"STORAGE",  "id":22262,   "ctx":"initandlisten","msg":"Timestamp monitor starting"}
{"t":{"$date":"2023-03-14T16:37:35.177+00:00"},"s":"I",  "c":"STORAGE",  "id":20536,   "ctx":"initandlisten","msg":"Flow Control is enabled on this deployment"}
{"t":{"$date":"2023-03-14T16:37:35.178+00:00"},"s":"I",  "c":"SHARDING", "id":20997,   "ctx":"initandlisten","msg":"Refreshed RWC defaults","attr":{"newDefaults":{}}}
{"t":{"$date":"2023-03-14T16:37:35.183+00:00"},"s":"I",  "c":"FTDC",     "id":20625,   "ctx":"initandlisten","msg":"Initializing full-time diagnostic data capture","attr":{"dataDirectory":"/data/diagnostic.data"}}
{"t":{"$date":"2023-03-14T16:37:35.184+00:00"},"s":"I",  "c":"REPL",     "id":6015317, "ctx":"initandlisten","msg":"Setting new configuration state","attr":{"newState":"ConfigStartingUp","oldState":"ConfigPreStart"}}
{"t":{"$date":"2023-03-14T16:37:35.184+00:00"},"s":"I",  "c":"REPL",     "id":4280500, "ctx":"initandlisten","msg":"Attempting to create internal replication collections"}
{"t":{"$date":"2023-03-14T16:37:35.189+00:00"},"s":"I",  "c":"REPL",     "id":4280501, "ctx":"initandlisten","msg":"Attempting to load local voted for document"}
{"t":{"$date":"2023-03-14T16:37:35.189+00:00"},"s":"I",  "c":"REPL",     "id":4280502, "ctx":"initandlisten","msg":"Searching for local Rollback ID document"}
{"t":{"$date":"2023-03-14T16:37:35.191+00:00"},"s":"I",  "c":"REPL",     "id":21529,   "ctx":"initandlisten","msg":"Initializing rollback ID","attr":{"rbid":1}}
{"t":{"$date":"2023-03-14T16:37:35.191+00:00"},"s":"I",  "c":"REPL",     "id":4280504, "ctx":"initandlisten","msg":"Cleaning up any partially applied oplog batches & reading last op from oplog"}
{"t":{"$date":"2023-03-14T16:37:35.193+00:00"},"s":"I",  "c":"REPL",     "id":21544,   "ctx":"initandlisten","msg":"Recovering from stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1678806686,"i":4}},"topOfOplog":{"ts":{"$timestamp":{"t":1678806692,"i":2}},"t":9},"appliedThrough":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"oplogTruncateAfterPoint":{"$timestamp":{"t":0,"i":0}}}}
{"t":{"$date":"2023-03-14T16:37:35.193+00:00"},"s":"I",  "c":"REPL",     "id":21545,   "ctx":"initandlisten","msg":"Starting recovery oplog application at the stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1678806686,"i":4}}}}
{"t":{"$date":"2023-03-14T16:37:35.193+00:00"},"s":"I",  "c":"REPL",     "id":21550,   "ctx":"initandlisten","msg":"Replaying stored operations from startPoint (inclusive) to endPoint (inclusive)","attr":{"startPoint":{"$timestamp":{"t":1678806686,"i":4}},"endPoint":{"$timestamp":{"t":1678806692,"i":2}}}}
{"t":{"$date":"2023-03-14T16:37:35.197+00:00"},"s":"I",  "c":"REPL",     "id":21536,   "ctx":"initandlisten","msg":"Completed oplog application for recovery","attr":{"numOpsApplied":2,"numBatches":1,"applyThroughOpTime":{"ts":{"$timestamp":{"t":1678806692,"i":2}},"t":9}}}
{"t":{"$date":"2023-03-14T16:37:35.198+00:00"},"s":"I",  "c":"REPL",     "id":4280506, "ctx":"initandlisten","msg":"Reconstructing prepared transactions"}
{"t":{"$date":"2023-03-14T16:37:35.200+00:00"},"s":"I",  "c":"REPL",     "id":4280507, "ctx":"initandlisten","msg":"Loaded replica set config, scheduled callback to set local config"}
{"t":{"$date":"2023-03-14T16:37:35.200+00:00"},"s":"I",  "c":"REPL",     "id":4280508, "ctx":"ReplCoord-0","msg":"Attempting to set local replica set config; validating config for startup"}
{"t":{"$date":"2023-03-14T16:37:35.201+00:00"},"s":"I",  "c":"CONTROL",  "id":20714,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"NotYetInitialized: Replication has not yet been configured"}}
{"t":{"$date":"2023-03-14T16:37:35.201+00:00"},"s":"I",  "c":"CONTROL",  "id":20711,   "ctx":"LogicalSessionCacheReap","msg":"Failed to reap transaction table","attr":{"error":"NotYetInitialized: Replication has not yet been configured"}}
{"t":{"$date":"2023-03-14T16:37:35.201+00:00"},"s":"I",  "c":"REPL",     "id":40440,   "ctx":"initandlisten","msg":"Starting the TopologyVersionObserver"}
{"t":{"$date":"2023-03-14T16:37:35.201+00:00"},"s":"I",  "c":"REPL",     "id":40445,   "ctx":"TopologyVersionObserver","msg":"Started TopologyVersionObserver"}
{"t":{"$date":"2023-03-14T16:37:35.202+00:00"},"s":"I",  "c":"NETWORK",  "id":23015,   "ctx":"listener","msg":"Listening on","attr":{"address":"/tmp/mongodb-27017.sock"}}
{"t":{"$date":"2023-03-14T16:37:35.202+00:00"},"s":"I",  "c":"NETWORK",  "id":23015,   "ctx":"listener","msg":"Listening on","attr":{"address":"0.0.0.0"}}
{"t":{"$date":"2023-03-14T16:37:35.202+00:00"},"s":"I",  "c":"NETWORK",  "id":23016,   "ctx":"listener","msg":"Waiting for connections","attr":{"port":27017,"ssl":"off"}}
{"t":{"$date":"2023-03-14T16:37:35.202+00:00"},"s":"W",  "c":"NETWORK",  "id":21207,   "ctx":"ReplCoord-0","msg":"getaddrinfo() failed","attr":{"host":"mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local","error":"Name or service not known"}}
{"t":{"$date":"2023-03-14T16:37:35.203+00:00"},"s":"I",  "c":"NETWORK",  "id":4834700, "ctx":"ReplCoord-0","msg":"isSelf could not connect via connectSocketOnly","attr":{"hostAndPort":"mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"couldn't connect to server mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017, connection attempt failed: HostNotFound: Could not find address for mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017: SocketException: Host not found (authoritative)"}}}
{"t":{"$date":"2023-03-14T16:37:35.205+00:00"},"s":"W",  "c":"NETWORK",  "id":21207,   "ctx":"ReplCoord-0","msg":"getaddrinfo() failed","attr":{"host":"mongodb-rs-srv-2.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local","error":"Name or service not known"}}
{"t":{"$date":"2023-03-14T16:37:35.206+00:00"},"s":"I",  "c":"NETWORK",  "id":4834700, "ctx":"ReplCoord-0","msg":"isSelf could not connect via connectSocketOnly","attr":{"hostAndPort":"mongodb-rs-srv-2.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017","error":{"code":6,"codeName":"HostUnreachable","errmsg":"couldn't connect to server mongodb-rs-srv-2.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017, connection attempt failed: HostNotFound: Could not find address for mongodb-rs-srv-2.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017: SocketException: Host not found (authoritative)"}}}
{"t":{"$date":"2023-03-14T16:37:35.206+00:00"},"s":"I",  "c":"REPL",     "id":4280509, "ctx":"ReplCoord-0","msg":"Local configuration validated for startup"}
{"t":{"$date":"2023-03-14T16:37:35.206+00:00"},"s":"I",  "c":"REPL",     "id":6015317, "ctx":"ReplCoord-0","msg":"Setting new configuration state","attr":{"newState":"ConfigSteady","oldState":"ConfigStartingUp"}}
{"t":{"$date":"2023-03-14T16:37:35.206+00:00"},"s":"I",  "c":"REPL",     "id":21392,   "ctx":"ReplCoord-0","msg":"New replica set config in use","attr":{"config":{"_id":"mongodb-rs-srv","version":94294,"term":9,"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"members":[{"_id":0,"host":"mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":1,"host":"mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":2,"host":"mongodb-rs-srv-2.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1}],"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":{"$oid":"63f681bf77011d899713397c"}}}}}
{"t":{"$date":"2023-03-14T16:37:35.206+00:00"},"s":"I",  "c":"REPL",     "id":21393,   "ctx":"ReplCoord-0","msg":"Found self in config","attr":{"hostAndPort":"mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017"}}
{"t":{"$date":"2023-03-14T16:37:35.206+00:00"},"s":"I",  "c":"REPL",     "id":21358,   "ctx":"ReplCoord-0","msg":"Replica set state transition","attr":{"newState":"STARTUP2","oldState":"STARTUP"}}
{"t":{"$date":"2023-03-14T16:37:35.216+00:00"},"s":"I",  "c":"REPL",     "id":21320,   "ctx":"ReplCoord-0","msg":"Updated term","attr":{"term":9}}
{"t":{"$date":"2023-03-14T16:37:35.216+00:00"},"s":"I",  "c":"REPL",     "id":21306,   "ctx":"ReplCoord-0","msg":"Starting replication storage threads"}
{"t":{"$date":"2023-03-14T16:37:35.216+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017"}}

I think that the root cause of the issue is because the "mongodb-agent" is running a Readinessprobe which is failing:

$ kubectl exec -ti mongodb-rs-srv-0 -c mongodb-agent -- bash

$ cat /var/log/mongodb-mms-automation/healthstatus/readiness.log
{"level":"debug","ts":1678799046.6867993,"msg":"The Agent hasn't reported working on the new config yet, the last plan finished at 2023-03-14T12:58:50Z"}
{"level":"debug","ts":1678799056.7024765,"msg":"The Agent hasn't reported working on the new config yet, the last plan finished at 2023-03-14T12:58:50Z"}
{"level":"debug","ts":1678799056.7925813,"msg":"The Agent hasn't reported working on the new config yet, the last plan finished at 2023-03-14T12:58:50Z"}
{"level":"debug","ts":1678799066.6996677,"msg":"The Agent hasn't reported working on the new config yet, the last plan finished at 2023-03-14T12:58:50Z"}
{"level":"debug","ts":1678799076.7003417,"msg":"The Agent hasn't reported working on the new config yet, the last plan finished at 2023-03-14T12:58:50Z"} 

$ cat /var/log/mongodb-mms-automation/healthstatus/agent-health-status.json
{"statuses":{"mongodb-rs-srv-0":{"IsInGoalState":true,"LastMongoUpTime":1678809801,"ExpectedToBeUp":true,"ReplicationStatus":2}},"mmsStatus":{"mongodb-rs-srv-0":{"name":"mongodb-rs-srv-0","lastGoalVersionAchieved":1703,"plans":[{"automationConfigVersion":1418,"started":"2023-03-14T15:12:41.134133861Z","completed":"2023-03-14T15:12:48.339057578Z","moves":[{"move":"Start","moveDoc":"Start the process","steps":[{"step":"StartFresh","stepDoc":"Start a mongo instance  (start fresh)","isWaitStep":false,"started":"2023-03-14T15:12:41.134151424Z","completed":"2023-03-14T15:12:48.011754987Z","result":"success"}]},{"move":"WaitAllRsMembersUp","moveDoc":"Wait until all members of this process' repl set are up","steps":[{"step":"WaitAllRsMembersUp","stepDoc":"Wait until all members of this process' repl set are up","isWaitStep":true,"started":"2023-03-14T15:12:48.011803514Z","completed":null,"result":"wait"}]},{"move":"RsInit","moveDoc":"Initialize a replica set including the current MongoDB process","steps":[{"step":"RsInit","stepDoc":"Initialize a replica set","isWaitStep":false,"started":null,"completed":null,"result":""}]},{"move":"WaitFeatureCompatibilityVersionCorrect","moveDoc":"Wait for featureCompatibilityVersion to be right","steps":[{"step":"WaitFeatureCompatibilityVersionCorrect","stepDoc":"Wait for featureCompatibilityVersion to be right","isWaitStep":true,"started":null,"completed":null,"result":""}]}]}],"errorCode":0,"errorString":""}}}

$ cat /var/lib/automation/config/cluster-config.json 
{"version":1744,"processes":[{"name":"mongodb-rs-srv-0","disabled":false,"hostname":"mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local","args2_6":{"net":{"port":27017},"replication":{"replSetName":"mongodb-rs-srv"},"storage":{"dbPath":"/data","wiredTiger":{"engineConfig":{"cacheSizeGB":0.25}}}},"featureCompatibilityVersion":"4.4","processType":"mongod","version":"4.4.18","authSchemaVersion":5},{"name":"mongodb-rs-srv-1","disabled":false,"hostname":"mongodb-rs-srv-1.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local","args2_6":{"net":{"port":27017},"replication":{"replSetName":"mongodb-rs-srv"},"storage":{"dbPath":"/data","wiredTiger":{"engineConfig":{"cacheSizeGB":0.25}}}},"featureCompatibilityVersion":"4.4","processType":"mongod","version":"4.4.18","authSchemaVersion":5},{"name":"mongodb-rs-srv-2","disabled":false,"hostname":"mongodb-rs-srv-2.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local","args2_6":{"net":{"port":27017},"replication":{"replSetName":"mongodb-rs-srv"},"storage":{"dbPath":"/data","wiredTiger":{"engineConfig":{"cacheSizeGB":0.25}}}},"featureCompatibilityVersion":"4.4","processType":"mongod","version":"4.4.18","authSchemaVersion":5}],"replicaSets":[{"_id":"mongodb-rs-srv","members":[{"_id":0,"host":"mongodb-rs-srv-0","priority":1,"arbiterOnly":false,"votes":1},{"_id":1,"host":"mongodb-rs-srv-1","priority":1,"arbiterOnly":false,"votes":1},{"_id":2,"host":"mongodb-rs-srv-2","priority":1,"arbiterOnly":false,"votes":1}],"protocolVersion":"1","numberArbiters":0}],"auth":{"usersWanted":[{"mechanisms":[],"roles":[{"role":"root","db":"admin"}],"user":"mongodbuser","db":"admin","authenticationRestrictions":[],"scramSha256Creds":{"iterationCount":15000,"salt":"soqPW2xbJHC15qlmme0HSy6Wo0M1vFgiD1jwdQ==","serverKey":"3p915chV6xKBkEZOEUQR0KYngBhIUSGQxQ9xpsHOwcw=","storedKey":"ec1fgE6vhkQOA/BzNLiyb7rWpnVWW7cC85VNF7ws8qM="},"scramSha1Creds":{"iterationCount":10000,"salt":"OvqKQEgmfic4PVWgCEcUxg==","serverKey":"iD2+U1LifrntktZqz8jDTNpbd8s=","storedKey":"FOeRqSCZ3/u35YFt+Fsq6JFgqVI="}},{"mechanisms":[],"roles":[{"role":"root","db":"admin"}],"user":"appuser","db":"admin","authenticationRestrictions":[],"scramSha256Creds":{"iterationCount":15000,"salt":"Wi1ZucUzvA4Q8gNJXiri0TeHxdvrZhIEWu6q4g==","serverKey":"QUcVwUF67XHsmF/oE35O46HBi4PpucmOtlRwLlNVVqM=","storedKey":"oa5vyuT17ujc1eF+KhP4OD5o0bdJ+KbA8dQv2mvzQzE="},"scramSha1Creds":{"iterationCount":10000,"salt":"7GVyo+7b/ltEEJvsqj/sJA==","serverKey":"R8IIf0HgmM4OiXbjjQPghJLn5Mw=","storedKey":"0dn1R9ZG4wN2G5qQ01pzA/hQi5U="}}],"disabled":false,"authoritativeSet":true,"autoAuthMechanisms":["MONGODB-CR","SCRAM-SHA-256"],"autoAuthMechanism":"SCRAM-SHA-256","deploymentAuthMechanisms":["MONGODB-CR","SCRAM-SHA-256"],"autoUser":"mms-automation","key":"rjWoth3LExL3erFs7Fe7JUmIh63TEXN2hl4Dg6R4BH6FJNSHiM6vBDfe58W/fVQsGpLUD9PMxP5yANnAvApKvvRHR2dklBZi/Vu4rMVUHQkq8cL1yirlDPhR/B7yeiXJZDxh8y+9OfDckqe2ihxgWr+Y7J1sXqo69o8sRiM+2VNsbpfU4U+/neOZriPR2cIkUbuFBiGigJPw/lZRapurSgyh9P7cZsjor2482hd9Cqihw9ayG8pXEyy59AoQKLomEJB85mRYrHx1ADeZyR0j3ndF1G7kkYUJZoEvIbsfyDh5298/Xr8bjySxo7OUSrG0wuw3XXjaeV38jAD7iQ5EWHiUomCcZAMLtS5byE+dbm7K7HgrPK1iWbl1dc+StxuLdULV/HbUyjAb0C/wpEZ26DA4jYBwQsw0W46hAMrB+yEEGhf+shnmho4B3AsbKEfxNzmI+0V7TA65SlGF1oR5aFWeplqB3ajV+G6PzVIzDyQxlz5Gsf51mZbmCA4izoZlPaasNLZL0TsnJPe0hN2xvthGdwFxKsxSNgDiFqt4Coy7Oy9gmAO9mGt7714v4OTyUqhNvVI1ZhhbI59q83KG3sG8uf87cG2YYAEEL7mc+edxwTiCbOfbdeG/h7/y837hn0culEw3mhHri12ptdIsEaT87SE=","keyfile":"/var/lib/mongodb-mms-automation/authentication/keyfile","keyfileWindows":"%SystemDrive%\\MMSAutomation\\versions\\keyfile","autoPwd":"wuPLtcnlhT3-2BjXoJWe"},"tls":{"CAFilePath":"","clientCertificateMode":"OPTIONAL"},"mongoDbVersions":[{"name":"4.4.18","builds":[{"platform":"linux","url":"","gitVersion":"","architecture":"amd64","flavor":"rhel","minOsVersion":"","maxOsVersion":"","modules":[]},{"platform":"linux","url":"","gitVersion":"","architecture":"amd64","flavor":"ubuntu","minOsVersion":"","maxOsVersion":"","modules":[]}]}],"backupVersions":[],"monitoringVersions":[],"options":{"downloadBase":"/var/lib/mongodb-mms-automation"}}

$ cat /var/log/mongodb-mms-automation/automation-agent.log
[2023-03-14T15:12:40.623+0000] [header.info] [::0]        GitCommitId = 25bb5320d7087c7aa24eb6118df217a028238723
[2023-03-14T15:12:40.623+0000] [header.info] [::0]  AutomationVersion = 12.0.15.7646
[2023-03-14T15:12:40.623+0000] [header.info] [::0]          localhost = mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local
[2023-03-14T15:12:40.623+0000] [header.info] [::0] ErrorStateSleepTime = 10s
[2023-03-14T15:12:40.623+0000] [header.info] [::0] GoalStateSleepTime = 10s
[2023-03-14T15:12:40.623+0000] [header.info] [::0] NotGoalStateSleepTime = 1s
[2023-03-14T15:12:40.623+0000] [header.info] [::0]     PlanCutoffTime = 300000
[2023-03-14T15:12:40.623+0000] [header.info] [::0]       TracePlanner = false
[2023-03-14T15:12:40.623+0000] [header.info] [::0]               User = mongodb
[2023-03-14T15:12:40.623+0000] [header.info] [::0]         Go version = go1.18.5
[2023-03-14T15:12:40.623+0000] [header.info] [::0]         MmsBaseURL = 
[2023-03-14T15:12:40.623+0000] [header.info] [::0]         MmsGroupId = 
[2023-03-14T15:12:40.623+0000] [header.info] [::0]          HttpProxy = 
[2023-03-14T15:12:40.623+0000] [header.info] [::0] DisableHttpKeepAlive = false
[2023-03-14T15:12:40.623+0000] [header.info] [::0]        HttpsCAFile = 
[2023-03-14T15:12:40.623+0000] [header.info] [::0] TlsRequireValidMMSServerCertificates = true
[2023-03-14T15:12:40.623+0000] [header.info] [::0] TlsMMSServerClientCertificate = 
[2023-03-14T15:12:40.623+0000] [header.info] [::0] KMIPProxyCertificateDir = /tmp
[2023-03-14T15:12:40.623+0000] [header.info] [::0] EnableLocalConfigurationServer = false
[2023-03-14T15:12:40.623+0000] [header.info] [::0] DialTimeoutSeconds = 40
[2023-03-14T15:12:40.623+0000] [header.info] [::0] KeepUnusedMongodbVersions = false
[2023-03-14T15:12:40.623+0000] [header.info] [::0] DisallowDowngrades = false
[2023-03-14T15:12:41.837+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:41.836] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:41.938+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:41.938] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:42.760+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:42.760] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:42.861+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:42.861] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:43.813+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:43.813] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:43.915+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:43.915] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:44.736+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:44.736] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:44.838+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:44.838] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:45.790+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:45.790] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:45.892+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:45.892] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:46.844+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:46.844] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down
[2023-03-14T15:12:46.946+0000] [.error] [src/mongoctl/processctl.go:RunCommand:1120] <hardwareMetricsCollector> [15:12:46.946] Server at mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017 (local=false) is down

$ curl -vk mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017

*   Trying 11.92.6.56:27017...
* TCP_NODELAY set
* Connected to mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local (11.92.6.56) port 27017 (#0)
> GET / HTTP/1.1
> Host: mongodb-rs-srv-0.mongodb-rs-srv-svc.dev-pv-pv1-qa2-a.svc.appspace-engineering-services-as1-a.appspace.cluster.local:27017
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Connection: close
< Content-Type: text/plain
< Content-Length: 85
< 
It looks like you are trying to access MongoDB over HTTP on the native driver port.
* Closing connection 0

$ cat /var/log/mongodb-mms-automation/automation-agent-verbose.log
[2023-03-14T16:07:27.721+0000] [.debug] [src/healthcheck/status.go:Dump:131] [16:07:27.721] Dumped processes health statuses: {"statuses":{"mongodb-rs-srv-0":{"IsInGoalState":true,"LastMongoUpTime":1678810047,"ExpectedToBeUp":true,"ReplicationStatus":2}},"mmsStatus":{"mongodb-rs-srv-0":{"name":"mongodb-rs-srv-0","lastGoalVersionAchieved":1729,"plans":[{"automationConfigVersion":1729,"started":"2023-03-14T16:07:18.879846161Z","completed":"2023-03-14T16:07:27.230825526Z","moves":[{"move":"Start","moveDoc":"Start the process","steps":[{"step":"StartFresh","stepDoc":"Start a mongo instance  (start fresh)","isWaitStep":false,"started":"2023-03-14T16:07:18.879879182Z","completed":"2023-03-14T16:07:25.931591111Z","result":"success"}]},{"move":"WaitAllRsMembersUp","moveDoc":"Wait until all members of this process' repl set are up","steps":[{"step":"WaitAllRsMembersUp","stepDoc":"Wait until all members of this process' repl set are up","isWaitStep":true,"started":"2023-03-14T16:07:25.931669234Z","completed":null,"result":"wait"}]},{"move":"RsInit","moveDoc":"Initialize a replica set including the current MongoDB process","steps":[{"step":"RsInit","stepDoc":"Initialize a replica set","isWaitStep":false,"started":null,"completed":null,"result":""}]},{"move":"WaitFeatureCompatibilityVersionCorrect","moveDoc":"Wait for featureCompatibilityVersion to be right","steps":[{"step":"WaitFeatureCompatibilityVersionCorrect","stepDoc":"Wait for featureCompatibilityVersion to be right","isWaitStep":true,"started":null,"completed":null,"result":""}]}]}],"errorCode":0,"errorString":""}}}
[2023-03-14T16:07:27.721+0000] [.info] [main/components/agent.go:Iterate:895] [16:07:27.721] All 1 Mongo processes are in goal state. 
[2023-03-14T16:07:27.721+0000] [.debug] [src/main/cm.go:mainLoop:411] [16:07:27.721] Starting main loop iteration 9 at 2023-03-14T16-07-27
[2023-03-14T16:07:27.721+0000] [.debug] [src/main/cm.go:mainLoop:420] [16:07:27.721] Checking for config updates...
[2023-03-14T16:07:27.722+0000] [.info] [src/config/config.go:ReadClusterConfig:433] [16:07:27.721] Retrieving cluster config from /var/lib/automation/config/cluster-config.json...
[2023-03-14T16:07:27.722+0000] [.debug] [src/config/config.go:ReadClusterConfig:451] [16:07:27.722] Not rereading cluster config at 16:07:27; lastEdition=2023-03-14 16:07:15.417825541 +0000 UTC

What did you expect? A clean ReplicaSet start-up.

Operator Information

        - name: OPERATOR_NAME
          value: mongodb-kubernetes-operator
        - name: AGENT_IMAGE
          value: quay.io/mongodb/mongodb-agent:12.0.15.7646-1
        - name: VERSION_UPGRADE_HOOK_IMAGE
          value: quay.io/mongodb/mongodb-kubernetes-operator-version-upgrade-post-start-hook:1.0.6
        - name: READINESS_PROBE_IMAGE
          value: quay.io/mongodb/mongodb-kubernetes-readinessprobe:1.0.12
        - name: MONGODB_IMAGE
          value: mongo
        - name: MONGODB_REPO_URL
          value: docker.io
        image: quay.io/mongodb/mongodb-kubernetes-operator:0.7.8

Kubernetes Cluster Information v1.23.16-gke.1100

WORKAROUND APPLIED

We had to delete the following ReadinessProbe in the mongo Statefulset:

        readinessProbe:
          exec:
            command:
            - /opt/scripts/readinessprobe
          failureThreshold: 40
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

Could anyone help me? I'm not an expert in mongodb and I'm still trying to figure out what's going on...

leosunmo commented 1 year ago

I am seeing the same issue. Upgrading to Operator 0.8.0 made no difference.

The second Pod in the Statefulset is unable to start up because the first Pod is not ready. The reason the first Pod is not ready is because the step WaitAllRsMembersUp has a result of wait, as you can see in the /var/log/mongodb-mms-automation/healthstatus/agent-health-status.json.

This will obviously never work, since the first Pod is not ready until the second Pod is ready, which it can't be since it's relying on the first Pod to be ready...

I seem to occasionally get the second Pod to start, by pure chance and some patience. The first Pod seems to randomly become ready sometimes just long enough for the second Pod to start up. This can end up with both Pods staying ready for a while.

k get po
NAME                                          READY   STATUS    RESTARTS   AGE
mdb-0                                         2/2     Running   0          5m5s
mdb-1                                         2/2     Running   0          4m47s
mongodb-kubernetes-operator-8f9756c67-47q6z   1/1     Running   0          2d18h

This doesn't last though, as the WaitAllRsMembersUp goes back to the wait result, causing readinessprobe to return 1.

I have no name!@mdb-0:/$ grep -o -E 'WaitAllRsMembersUp.+wait"}]' /var/log/mongodb-mms-automation/healthstatus/agent-health-status.json
WaitAllRsMembersUp","moveDoc":"Wait until all members of this process' repl set are up","steps":[{"step":"WaitAllRsMembersUp","stepDoc":"Wait until all members of this process' repl set are up","isWaitStep":true,"started":"2023-05-29T08:41:04.756579981Z","completed":null,"result":"wait"}]

I have no name!@mdb-0:/$ /opt/scripts/readinessprobe; echo $?
1

Then after a few minutes, both Pods are no longer ready. In this case, around ~8 minutes total.

k get po
NAME                                          READY   STATUS    RESTARTS   AGE
mdb-0                                         1/2     Running   0          9m8s
mdb-1                                         1/2     Running   0          8m50s
mongodb-kubernetes-operator-8f9756c67-47q6z   1/1     Running   0          2d18h

leosunmo commented 1 year ago

What's also confusing to me is that the Agent logs seem to indicate that it's happy with the state.

[2023-05-29T08:55:10.192+0000] [.info] [main/components/agent.go:Iterate:892] [08:55:10.192] All 1 Mongo processes are in goal state.

Even though the readinessprobe is not happy. At the same time, the Operator also isn't happy with the goal state.

2023-05-29T08:55:21.328Z    DEBUG   agent/agent_readiness.go:65 The Agent in the Pod 'mdb-0' hasn't reached the goal state yet (goal: 23771, agent: 23768)  {"ReplicaSet": "sorting/mdb"}

Look at the timestamps. They are almost always logging these conflicting messages at the same time.

github-actions[bot] commented 1 year ago

This issue is being marked stale because it has been open for 60 days with no activity. Please comment if this issue is still affecting you. If there is no change, this issue will be closed in 30 days.

slaskawi commented 1 year ago

This issue should be fixed by https://github.com/mongodb/mongodb-kubernetes-operator/pull/1332. Please upgrade your mongodb-kubernetes-readinessprobe to 1.0.15

dimovnike commented 1 year ago

this problem reappears when i have more than 1 replica and also the pod labels changed:

statefulSet:
    spec:
      selector:
        matchLabels:
          app: myapp
      template:
        metadata:
          labels:
            app: myapp

the labels were changed as suggested here https://github.com/mongodb/mongodb-kubernetes-operator/blob/master/config/samples/mongodb.com_v1_mongodbcommunity_cr_podantiaffinity.yaml

The stateful set and pod both become ready but the agent gets stuck in the "hasn't reached the goal state yet" state forever.

mongodb / mongodb-kubernetes-operator

Replicaset always in "is not yet ready - mondodb-agent shows => the Agent hasn't reported working on the new config yet, the last plan finished #1241