percona / percona-server-mongodb-operator

Percona Operator for MongoDB
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/
Apache License 2.0
352 stars 144 forks source link

Constant host unreachable errors within pod #1447

Open kevinlmadison opened 9 months ago

kevinlmadison commented 9 months ago

Report

Hi all, I’ve been trying to debug this for a good while now, I had a pretty large mongodb deployment and I thought I had a lot of logs simply due to its size. After some digging tho it seemed that they were mostly errors. The strange thing is that the cluster still appears to work. I’ve removed most of the components including sharding, backups, and pmm but I’m still seeing the errors even with a single replicaset. I’ve also disabled istio and effectively turned off the firewall. I’m deploying this cluster using the ansible helm module, and below I’ll paste the full config and a big sample of the logs.

More about the problem

and here is a sample of the logs:

{"t":{"$date":"2024-01-16T22:05:09.468+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn601","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:09.468+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn601","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:12.483+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn602","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:12.484+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn602","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:12.531+00:00"},"s":"D1", "c":"REPL",     "id":6208204, "ctx":"conn469","msg":"Error while waiting for hello response","attr":{"status":{"code":262,"codeName":"ExceededTimeLimit","errmsg":"operation exceeded time limit"}}}
{"t":{"$date":"2024-01-16T22:05:13.179+00:00"},"s":"D1", "c":"REPL",     "id":6208204, "ctx":"conn444","msg":"Error while waiting for hello response","attr":{"status":{"code":262,"codeName":"ExceededTimeLimit","errmsg":"operation exceeded time limit"}}}
{"t":{"$date":"2024-01-16T22:05:13.906+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn607","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:13.906+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn607","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:13.907+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn608","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:13.907+00:00"},"s":"D1", "c":"REPL",     "id":6208204, "ctx":"conn606","msg":"Error while waiting for hello response","attr":{"status":{"code":279,"codeName":"ClientDisconnect","errmsg":"operation was interrupted"}}}
{"t":{"$date":"2024-01-16T22:05:13.907+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/db/repl/replication_coordinator_impl.cpp","line":2453}}
{"t":{"$date":"2024-01-16T22:05:13.907+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn608","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1104}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1087}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1104}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1087}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"COMMAND",  "id":21962,   "ctx":"conn606","msg":"Assertion while executing command","attr":{"command":"hello","db":"admin","commandArgs":{"hello":1,"helloOk":true,"topologyVersion":{"processId":{"$oid":"65a6fbe68c4bda018b2c0028"},"counter":13},"maxAwaitTimeMS":10000,"$db":"admin","$readPreference":{"mode":"primaryPreferred"}},"error":"ClientDisconnect: operation was interrupted"}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:13.908+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn606","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn605","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn604","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn605","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn604","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"REPL",     "id":6208204, "ctx":"conn603","msg":"Error while waiting for hello response","attr":{"status":{"code":279,"codeName":"ClientDisconnect","errmsg":"operation was interrupted"}}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/db/repl/replication_coordinator_impl.cpp","line":2453}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1104}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1087}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1104}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1087}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"COMMAND",  "id":21962,   "ctx":"conn603","msg":"Assertion while executing command","attr":{"command":"hello","db":"admin","commandArgs":{"hello":1,"helloOk":true,"topologyVersion":{"processId":{"$oid":"65a6fbe68c4bda018b2c0028"},"counter":13},"maxAwaitTimeMS":10000,"$db":"admin","$readPreference":{"mode":"primaryPreferred"}},"error":"ClientDisconnect: operation was interrupted"}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:13.913+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn603","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:14.137+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn609","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:14.137+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn609","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:14.137+00:00"},"s":"D1", "c":"REPL",     "id":6208204, "ctx":"conn610","msg":"Error while waiting for hello response","attr":{"status":{"code":279,"codeName":"ClientDisconnect","errmsg":"operation was interrupted"}}}
{"t":{"$date":"2024-01-16T22:05:14.137+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/db/repl/replication_coordinator_impl.cpp","line":2453}}
{"t":{"$date":"2024-01-16T22:05:14.137+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1104}}
{"t":{"$date":"2024-01-16T22:05:14.138+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1087}}
{"t":{"$date":"2024-01-16T22:05:14.138+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1104}}
{"t":{"$date":"2024-01-16T22:05:14.138+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"ClientDisconnect: operation was interrupted","file":"src/mongo/util/future_impl.h","line":1087}}
{"t":{"$date":"2024-01-16T22:05:14.138+00:00"},"s":"D1", "c":"COMMAND",  "id":21962,   "ctx":"conn610","msg":"Assertion while executing command","attr":{"command":"hello","db":"admin","commandArgs":{"hello":1,"helloOk":true,"topologyVersion":{"processId":{"$oid":"65a6fbe68c4bda018b2c0028"},"counter":13},"maxAwaitTimeMS":10000,"$db":"admin","$readPreference":{"mode":"primaryPreferred"}},"error":"ClientDisconnect: operation was interrupted"}}
{"t":{"$date":"2024-01-16T22:05:14.138+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn611","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:14.138+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn611","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:14.139+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:14.139+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn610","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
{"t":{"$date":"2024-01-16T22:05:15.464+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn612","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":299}}
{"t":{"$date":"2024-01-16T22:05:15.464+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn612","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}
"t":{"$date":"2024-01-16T22:10:25.733+00:00"},"s":"D1", "c":"ASSERT",   "id":23074,   "ctx":"conn1060","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":444}}

Custom resource is reporting a ready status, here are its logs:

2024-01-22T17:42:38.722Z    INFO    Created a new mongo key    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "26ab8173-6077-437d-9d50-6123d1c182a2", "KeyName": "kev-test-psmdb-db-mongodb-keyfile"}
2024-01-22T17:42:38.731Z    INFO    Created a new mongo key    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "26ab8173-6077-437d-9d50-6123d1c182a2", "KeyName": "kev-test-psmdb-db-mongodb-encryption-key"}
2024-01-22T17:42:38.779Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "26ab8173-6077-437d-9d50-6123d1c182a2", "replset": "shard", "size": 3, "pods": 0}
2024-01-22T17:42:38.799Z    INFO    Cluster state changed    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "26ab8173-6077-437d-9d50-6123d1c182a2", "previous": "", "current": "initializing"}
2024-01-22T17:42:38.850Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "af333758-e95d-4100-9f52-cfe3cbf57cdb", "replset": "shard", "size": 3, "pods": 0}
2024-01-22T17:42:43.867Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "f074b1e2-166a-456d-b8c4-c27b61b0d77a", "replset": "shard", "size": 3, "pods": 1}
2024-01-22T17:42:43.947Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "009f498e-7342-427e-8e6e-17b82542081d", "replset": "shard", "size": 3, "pods": 1}
2024-01-22T17:42:48.943Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "987a4804-a466-4fe3-8ab1-3ae5140fd989", "replset": "shard", "size": 3, "pods": 1}
2024-01-22T17:42:53.989Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "cfab78a1-aa52-4adb-a08c-a25b6b5bf0a5", "replset": "shard", "size": 3, "pods": 1}
2024-01-22T17:42:59.029Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "e81e933d-7d9e-4ef5-b173-cf0bd43303ac", "replset": "shard", "size": 3, "pods": 1}
2024-01-22T17:43:04.078Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "50774dcf-0a7b-4a39-b645-b9c79e266449", "replset": "shard", "size": 3, "pods": 1}
2024-01-22T17:43:09.162Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "3fa6b94b-3b6b-4235-ad28-c66e97550762", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:09.260Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "49550521-00e6-4d8c-be77-756ca1e99cd0", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:14.248Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "df050785-9e5c-4a0e-b4ed-89eb420d1696", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:19.312Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "fe180278-1aaf-480a-a804-398c9df72e9a", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:24.358Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "c48ac446-c9f1-409d-91b5-f6047cc05b3c", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:29.398Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "ad5fc822-752e-4b4e-8f4a-183bff39c200", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:34.453Z    INFO    Waiting for the pods    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "3ebe6019-550b-4ac0-b2cc-9c5e55b7a3f2", "replset": "shard", "size": 3, "pods": 2}
2024-01-22T17:43:49.532Z    INFO    initiating replset    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "0d2ff7eb-2cdc-4d08-93e1-a4dc02dc10a4", "replset": "shard", "pod": "kev-test-psmdb-db-shard-0"}
2024-01-22T17:43:58.669Z    INFO    replset initialized    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "0d2ff7eb-2cdc-4d08-93e1-a4dc02dc10a4", "replset": "shard", "pod": "kev-test-psmdb-db-shard-0"}
2024-01-22T17:43:59.205Z    INFO    Fixing member tags    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "459a4f0b-f928-4fc3-ad20-e22c5d92c4c3", "replset": "shard"}
2024-01-22T17:43:59.205Z    DEBUG    Running replSetReconfig config    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "459a4f0b-f928-4fc3-ad20-e22c5d92c4c3", "cfg": {"_id":"shard","version":2,"members":[{"_id":0,"host":"kev-test-psmdb-db-shard-0.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{"podName":"kev-test-psmdb-db-shard-0","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":1}],"protocolVersion":1,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":"65aea9578f40f2bb98bd0171"},"writeConcernMajorityJournalDefault":true}}
2024-01-22T17:43:59.209Z    INFO    Adding new nodes    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "459a4f0b-f928-4fc3-ad20-e22c5d92c4c3", "replset": "shard"}
2024-01-22T17:43:59.209Z    DEBUG    Running replSetReconfig config    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "459a4f0b-f928-4fc3-ad20-e22c5d92c4c3", "cfg": {"_id":"shard","version":3,"members":[{"_id":0,"host":"kev-test-psmdb-db-shard-0.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{"podName":"kev-test-psmdb-db-shard-0","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":1},{"_id":1,"host":"kev-test-psmdb-db-shard-1.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-1","serviceName":"kev-test-psmdb-db"},"votes":1}],"protocolVersion":1,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":"65aea9578f40f2bb98bd0171"},"writeConcernMajorityJournalDefault":true}}
2024-01-22T17:43:59.231Z    INFO    Configuring member votes and priorities    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "459a4f0b-f928-4fc3-ad20-e22c5d92c4c3", "replset": "shard"}
2024-01-22T17:43:59.231Z    DEBUG    Running replSetReconfig config    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "459a4f0b-f928-4fc3-ad20-e22c5d92c4c3", "cfg": {"_id":"shard","version":4,"members":[{"_id":0,"host":"kev-test-psmdb-db-shard-0.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-0","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":1},{"_id":1,"host":"kev-test-psmdb-db-shard-1.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":0,"tags":{"podName":"kev-test-psmdb-db-shard-1","serviceName":"kev-test-psmdb-db"},"votes":0}],"protocolVersion":1,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":"65aea9578f40f2bb98bd0171"},"writeConcernMajorityJournalDefault":true}}
2024-01-22T17:44:04.188Z    INFO    Adding new nodes    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "ca1d90c5-512e-4d47-b455-14193e6634af", "replset": "shard"}
2024-01-22T17:44:04.188Z    DEBUG    Running replSetReconfig config    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "ca1d90c5-512e-4d47-b455-14193e6634af", "cfg": {"_id":"shard","version":6,"members":[{"_id":0,"host":"kev-test-psmdb-db-shard-0.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-0","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":1},{"_id":1,"host":"kev-test-psmdb-db-shard-1.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":0,"tags":{"podName":"kev-test-psmdb-db-shard-1","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":0},{"_id":2,"host":"kev-test-psmdb-db-shard-2.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-2","serviceName":"kev-test-psmdb-db"},"votes":1}],"protocolVersion":1,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":"65aea9578f40f2bb98bd0171"},"writeConcernMajorityJournalDefault":true}}
2024-01-22T17:44:04.216Z    INFO    Configuring member votes and priorities    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "ca1d90c5-512e-4d47-b455-14193e6634af", "replset": "shard"}
2024-01-22T17:44:04.216Z    DEBUG    Running replSetReconfig config    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "ca1d90c5-512e-4d47-b455-14193e6634af", "cfg": {"_id":"shard","version":7,"members":[{"_id":0,"host":"kev-test-psmdb-db-shard-0.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-0","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":1},{"_id":1,"host":"kev-test-psmdb-db-shard-1.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-1","serviceName":"kev-test-psmdb-db"},"secondaryDelaySecs":0,"votes":1},{"_id":2,"host":"kev-test-psmdb-db-shard-2.kev-test-psmdb-db-shard.app.svc.dev.ahq:27017","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":2,"tags":{"podName":"kev-test-psmdb-db-shard-2","serviceName":"kev-test-psmdb-db"},"votes":1}],"protocolVersion":1,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":"65aea9578f40f2bb98bd0171"},"writeConcernMajorityJournalDefault":true}}
2024-01-22T17:44:16.958Z    INFO    Cluster state changed    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "de7ea753-5e72-4be0-988a-08850b8ffd9d", "previous": "initializing", "current": "ready"}
2024-01-22T17:44:17.309Z    INFO    update Mongo version to 6.0.9-7 (fetched from db)    {"controller": "psmdb-controller", "object": {"name":"kev-test-psmdb-db","namespace":"app"}, "namespace": "app", "name": "kev-test-psmdb-db", "reconcileID": "d5aee049-1b1a-4b61-a34b-d46db7cb5736"}

Steps to reproduce

- name: Deploy Percona Server for MongoDB
  kubernetes.core.helm:
    name: kev-test-mongodb
    chart_ref: percona/psmdb-db
    chart_version: "1.15.1"
    release_namespace: app
    wait: true
    wait_timeout: "10m"
    values:
      clusterServiceDNSSuffix: 'svc.{{ cluster_domain }}'
      finalizers:
        - delete-psmdb-pvc
      nameOverride: ""
      fullnameOverride: ""
      crVersion: 1.15.0
      pause: false
      unmanaged: false
      allowUnsafeConfigurations: false
      multiCluster:
        enabled: false
      updateStrategy: SmartUpdate
      # updateStrategy: RollingUpdate
      upgradeOptions:
        versionServiceEndpoint: https://check.percona.com
        apply: disabled
        schedule: "0 2 * * *"
        setFCV: false
      image:
        repository: percona/percona-server-mongodb
        tag: 6.0.9-7
      imagePullPolicy: Always
      secrets: {}
      pmm:
        enabled: false
      replsets:
        - name: shard
          size: 3
          annotations:
            sidecar.istio.io/inject: "false"
          configuration: |
            security:
              enableEncryption: false
            systemLog:
              verbosity: 1
          serviceAccountName: app
          storage:
            engine: inMemory
            inMemory:
              engineConfig:
                inMemorySizeRatio: 0.9
          podDisruptionBudget:
            maxUnavailable: 1
          expose:
            enabled: true
            exposeType: ClusterIP
          nonvoting:
            enabled: false
            size: 1
          arbiter:
            enabled: false
            size: 1
          resources:
            limits:
              cpu: "2048m"
              memory: "5.0G"
            requests:
              cpu: "300m"
              memory: "0.5G"
          volumeSpec:
            pvc:
              storageClassName: "ceph-block"
              accessModes: [ "ReadWriteOnce" ]
              resources:
                requests:
                  storage: 1Gi
      sharding:
        enabled: false
      backup:
        enabled: false

Versions

  1. Kubernetes RKE2 1.28.3
  2. Operator percona/percona-server-mongodb-operator:1.15.0
  3. Database percona/percona-server-mongodb:6.0.9-7

Anything else?

No response

spron-in commented 9 months ago

Hello @kevinlmadison , quickly looking into the logs did not spark any insights right away, hence it needs a deeper look. I asked our MongoDB experts, they noticed that you use inMemory engine. They are asking if writeConcernMajorityJournalDefault was set to false or not.

kevinlmadison commented 9 months ago

@spron-in I did some testing around the writeConcernMajorityJournalDefault setting but this did not affect the error logs. To clarify I checked the value that we have set by default and it is true which I discovered is not compatible with the inMemory engine. I tried 1) Setting this value to false via rs.reconf() and 2) Running the replicaset with the wiredTiger engine. Neither of these changes affected the error logs. Hopefully this was a good test and hopefully I described it clearly. Thank you again!

spron-in commented 9 months ago

Hello @kevinlmadison ,

so I did some digging. First, this issue has nothing to do with the engine. WiredTiger produces the same errors. Sorry for leading you in the woods.

Second, seems user assert errors started to appear since MongoDB 5. There are a couple of issues opened about it: https://www.mongodb.com/community/forums/t/user-assertions-increasing-constantly-after-upgrade-from-4-4-4-to-5-0-3/127991 https://www.mongodb.com/community/forums/t/user-asserts-on-replica-set-running-on-kubernetes/162698

In short, what MongoDB Inc suggests that it is normal behavior and that you should do is reduce the log level. As long as you don't see any issues with the cluster health - what are your thoughts about lowering the log level?

systemLog:
  verbosity: 0
kevinlmadison commented 8 months ago

Hey @spron-in

Thank you so much for looking into this! With the information you've provided I think we will be able to confidently move forward. I'll use this issue to document further.

Thank you again!