monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 8 forks source link

Curator apparent leadership loss / flake #276

Closed lorenz closed 5 months ago

lorenz commented 8 months ago

The current curator has some weird behavior where it loses leadership for a very brief amount of time and then immediately regains it. Sadly most of the logs are unavailable due to the insane log volume the control plane nodes produce (~30kLines/min). This excerpt is complete, no curator log entries were left out between the two events.

root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.847105 trace.go:144] Span 17996a4f5b80aff6: RPC recv: HeartbeatUpdateRequest: 
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.847120 trace.go:138] Span 17996a4f5b80aff6: RPC send: HeartbeatUpdateResponse: 
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.851581 trace.go:138] Span 1799758897d16a38: RPC send: Node: pubkey:"\xffd\x93\x11\xef\xc1\xa2\xd6?/\xf3\xf0\xdb&\x0f@I\xcep 4`\xb6\xa6\xad\xdf=[+H\x17\xf1" id:"...
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.857828 server_authentication.go:97] Span 179975889958fae0: RPC invoked: unary request: /metropolis.node.core.curator.proto.api.Curator/CommitNode
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.857955 server_authentication.go:108] Span 179975889958fae0: RPC peerInfo: unauthenticated: pubkey ccd99580b27761cdca34c028374b737dbc2496dab117968cbf5636e9af93cb5a
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.857966 state_cluster.go:194] Span 179975889958fae0: loadCluster...
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.857978 impl_leader.go:96] Span 179975889958fae0: txnAsLeader(get: /cluster/configuration)...
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.858191 impl_leader.go:105] Span 179975889958fae0: txnAsLeader(...): rejected (lost leadership)
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.858217 server_authentication.go:115] Span 179975889958fae0: RPC send: error: rpc error: code = Unavailable desc = lost leadership
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.861264 server_authentication.go:97] Span 17997588998d68f3: RPC invoked: unary request: /metropolis.proto.api.Management/GetClusterInfo
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.861526 server_authentication.go:108] Span 17997588998d68f3: RPC peerInfo: user: owner
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.861541 impl_leader.go:96] Span 17997588998d68f3: txnAsLeader(get: /nodes/)...
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.866098 trace.go:144] Span 1798dbbf4a4c9962: RPC recv: HeartbeatUpdateRequest: 
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.866109 trace.go:138] Span 1798dbbf4a4c9962: RPC send: HeartbeatUpdateResponse: 
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.867596 impl_leader.go:108] Span 17997588998d68f3: txnAsLeader(...): ok
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.871071 trace.go:144] Span 17996a51a4390274: RPC recv: HeartbeatUpdateRequest: 
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.871087 trace.go:138] Span 17996a51a4390274: RPC send: HeartbeatUpdateResponse: 
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.871722 state_cluster.go:194] Span 17997588998d68f3: loadCluster...
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.871736 impl_leader.go:96] Span 17997588998d68f3: txnAsLeader(get: /cluster/configuration)...
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.872045 impl_leader.go:108] Span 17997588998d68f3: txnAsLeader(...): ok
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.872054 state_cluster.go:204] Span 17997588998d68f3: loadCluster: 1 KVs
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.872063 state_cluster.go:213] Span 17997588998d68f3: loadCluster: unmarshal ok
root.role.controlplane.launcher.curator.listener.rpc I1120 22:32:43.872167 server_authentication.go:117] Span 17997588998d68f3: RPC send: ok, GetClusterInfoResponse: cluster_directory:{nodes:{public_key:"\x00\x11\xdbtg7\x1d\xcb=ݲ\\۞\xb2I\x82\xbf\x85\xcaCY\xa7RU\x8...
q3k commented 8 months ago

I don't think it actually regains leadership anymore? Filtering out the heartbeat requests:

root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.269488 server_authentication.go:97] Span 1799a5ea962c325e: RPC invoked: unary request: /metropolis.node.core.curator.proto.api.Curator/UpdateNodeClusterNetworking
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.269696 server_authentication.go:108] Span 1799a5ea962c325e: RPC peerInfo: node: metropolis-0e654b73911962d5631be2e6e00bcf1d, PERMISSION_READ_CLUSTER_STATUS, PERMISSION_UPDATE_NODE_SELF
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.269715 state_node.go:363] Span 1799a5ea962c325e: loadNode(metropolis-0e654b73911962d5631be2e6e00bcf1d)...
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.269727 impl_leader.go:96] Span 1799a5ea962c325e: txnAsLeader(get: /nodes/metropolis-0e654b73911962d5631be2e6e00bcf1d)...
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.269934 impl_leader.go:105] Span 1799a5ea962c325e: txnAsLeader(...): rejected (lost leadership)
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.269949 server_authentication.go:115] Span 1799a5ea962c325e: RPC send: error: rpc error: code = Unavailable desc = lost leadership
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.750529 server_authentication.go:97] Span 1799a5eab2d848d0: RPC invoked: unary request: /metropolis.node.core.curator.proto.api.Curator/UpdateNodeClusterNetworking
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.750734 server_authentication.go:108] Span 1799a5eab2d848d0: RPC peerInfo: node: metropolis-98e948c2296dd0aae5897f9d4aeac32c, PERMISSION_READ_CLUSTER_STATUS, PERMISSION_UPDATE_NODE_SELF
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.750749 state_node.go:363] Span 1799a5eab2d848d0: loadNode(metropolis-98e948c2296dd0aae5897f9d4aeac32c)...
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.750759 impl_leader.go:96] Span 1799a5eab2d848d0: txnAsLeader(get: /nodes/metropolis-98e948c2296dd0aae5897f9d4aeac32c)...
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.750992 impl_leader.go:105] Span 1799a5eab2d848d0: txnAsLeader(...): rejected (lost leadership)
root.role.controlplane.launcher.curator.listener.rpc I1121 13:19:21.751006 server_authentication.go:115] Span 1799a5eab2d848d0: RPC send: error: rpc error: code = Unavailable desc = lost leadership

So it's 'lost leadership' in a tight loop. Seems like the curator leader/follower logic got stuck. If we want to stop the bleeding, restarting this node might help.

q3k commented 8 months ago

https://github.com/monogon-dev/monogon/blob/main/metropolis/node/core/curator/impl_leader.go#L104

This is the code path that's taken when this error is displayed.

q3k commented 8 months ago

After reboot:

root.role.controlplane.launcher.curator I1121 13:29:20.316399 curator.go:311] Waiting for consensus...
root.role.controlplane.launcher.curator I1121 13:29:22.950872 curator.go:318] Got consensus, starting up...
root.role.controlplane.launcher.curator I1121 13:29:22.950913 curator.go:343] Curator starting on prefix "/leader" with lease TTL of 10 seconds...
root.role.controlplane.launcher.curator I1121 13:29:22.950989 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:29:22.951096 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:29:22.951550 curator.go:262] Detecting our own stale lock, attempting to remove...
root.role.controlplane.launcher.curator I1121 13:29:22.952324 curator.go:284] Cleanup successful
root.role.controlplane.launcher.curator I1121 13:29:22.952686 curator.go:160] Curator established lease, ID: 1828052370563101710
root.role.controlplane.launcher.curator I1121 13:29:22.953243 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator I1121 13:29:22.953271 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8bf21239740e", rev 291683715)
root.role.controlplane.launcher.curator.listener I1121 13:29:23.263027 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:23.263715 listener.go:98] This curator is a leader.
root.role.controlplane.launcher.curator.listener I1121 13:29:23.270009 listener.go:142] Leader running until leadership lost.
root.role.controlplane.launcher.curator W1121 13:29:43.976224 curator.go:222] Session done, quitting.
root.role.controlplane.launcher.curator I1121 13:29:44.900802 curator.go:354] Curator election round done: curator session done
root.role.controlplane.launcher.curator I1121 13:29:44.900817 curator.go:355] Curator election restarting...
root.role.controlplane.launcher.curator I1121 13:29:44.900834 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:29:45.333246 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:46.042301 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:47.629728 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:29:49.167399 curator.go:160] Curator established lease, ID: 1828052370563103391
root.role.controlplane.launcher.curator.listener I1121 13:29:49.699680 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:29:52.279658 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:29:52.977318 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:52.978869 listener.go:114] This curator is a follower (leader is "metropolis-ff649311efc1a2d63f2ff3f0db260f40"), starting minimal implementation.
root.role.controlplane.launcher.curator.listener I1121 13:29:52.978918 listener.go:153] Follower running until leadership change.
root.role.controlplane.launcher.curator I1121 13:29:54.141846 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator W1121 13:29:54.141871 curator.go:222] Session done, quitting.
root.role.controlplane.launcher.curator I1121 13:29:54.142138 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8bf212397a9f", rev 291684547)
root.role.controlplane.launcher.curator.listener I1121 13:29:54.684120 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:54.685610 listener.go:98] This curator is a leader.
root.role.controlplane.launcher.curator.listener I1121 13:29:54.685684 listener.go:142] Leader running until leadership lost.
root.role.controlplane.launcher.curator I1121 13:29:54.876717 curator.go:354] Curator election round done: curator session done
root.role.controlplane.launcher.curator I1121 13:29:54.876730 curator.go:355] Curator election restarting...
root.role.controlplane.launcher.curator I1121 13:29:54.876757 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I1121 13:29:54.876770 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:29:55.404365 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:56.016302 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:29:57.387653 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:29:57.467450 curator.go:160] Curator established lease, ID: 1828052370563103590
root.role.controlplane.launcher.curator.listener I1121 13:29:59.417749 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:30:01.548779 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I1121 13:30:01.809584 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator W1121 13:30:01.809973 curator.go:222] Session done, quitting.
root.role.controlplane.launcher.curator I1121 13:30:01.812041 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8bf212397b66", rev 291684707)
root.role.controlplane.launcher.curator.listener I1121 13:30:02.177906 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:30:02.190079 listener.go:98] This curator is a leader.
root.role.controlplane.launcher.curator.listener I1121 13:30:02.190149 listener.go:142] Leader running until leadership lost.
root.role.controlplane.launcher.curator I1121 13:30:05.199225 curator.go:354] Curator election round done: curator session done
root.role.controlplane.launcher.curator I1121 13:30:05.199258 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I1121 13:30:05.199289 curator.go:355] Curator election restarting...
root.role.controlplane.launcher.curator I1121 13:30:05.199614 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I1121 13:30:05.498863 curator.go:160] Curator established lease, ID: 1828052370563103772
root.role.controlplane.launcher.curator.listener I1121 13:30:05.504645 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:30:05.580718 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:30:06.335290 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:30:06.340268 listener.go:114] This curator is a follower (leader is "metropolis-ff649311efc1a2d63f2ff3f0db260f40"), starting minimal implementation.
root.role.controlplane.launcher.curator.listener I1121 13:30:06.340320 listener.go:153] Follower running until leadership change.
root.role.controlplane.launcher.curator I1121 13:30:07.266672 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator I1121 13:30:07.267420 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8bf212397c1c", rev 291684855)
root.role.controlplane.launcher.curator.listener I1121 13:30:07.636717 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:30:07.638744 listener.go:98] This curator is a leader.
root.role.controlplane.launcher.curator.listener I1121 13:30:07.638835 listener.go:142] Leader running until leadership lost.
root.role.controlplane.launcher.curator W1121 13:30:09.115183 curator.go:222] Session done, quitting.
root.role.controlplane.launcher.curator I1121 13:30:10.410131 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I1121 13:30:10.410112 curator.go:354] Curator election round done: curator session done
root.role.controlplane.launcher.curator I1121 13:30:10.410173 curator.go:355] Curator election restarting...
root.role.controlplane.launcher.curator I1121 13:30:10.410262 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:30:11.166251 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:30:12.160140 curator.go:160] Curator established lease, ID: 1828052370563104016
root.role.controlplane.launcher.curator.listener I1121 13:30:12.224301 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator I1121 13:30:12.996170 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator.listener I1121 13:30:13.715342 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:30:13.728632 listener.go:114] This curator is a follower (leader is "metropolis-ff649311efc1a2d63f2ff3f0db260f40"), starting minimal implementation.
root.role.controlplane.launcher.curator.listener I1121 13:30:13.728688 listener.go:153] Follower running until leadership change.
root.role.controlplane.launcher.curator I1121 13:30:13.932864 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator I1121 13:30:13.932896 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8bf212397d10", rev 291685119)
root.role.controlplane.launcher.curator.listener I1121 13:30:14.561766 listener.go:59] Waiting for election status...
root.role.controlplane.launcher.curator.listener I1121 13:30:14.563159 listener.go:98] This curator is a leader.
root.role.controlplane.launcher.curator.listener I1121 13:30:14.563209 listener.go:142] Leader running until leadership lost.

Something's very broken here.

q3k commented 8 months ago

That particular flap cleared up by now, so it might've been a red herring (perhaps etcd was restarting for some reason?). Now we're back to the previous 'lost leadership' loop.

q3k commented 8 months ago

https://review.monogon.dev/c/monogon/+/2341 this should bring us one step closer to being able to debug this.

q3k commented 8 months ago

Looking at a state dump from this machine, etcd keyspace is fine at a glance - no leftover leadership keys, etc.

q3k commented 8 months ago

The node was rebooted once again and this time didn't seem to get into the 'lost leadership' loop this time:

root.role.controlplane.launcher.curator I1121 14:22:40.147324 curator.go:311] Waiting for consensus...
root.role.controlplane.launcher.curator I1121 14:22:43.068568 curator.go:318] Got consensus, starting up...
root.role.controlplane.launcher.curator I1121 14:22:43.068620 curator.go:343] Curator starting on prefix "/leader" with lease TTL of 10 seconds...
root.role.controlplane.launcher.curator I1121 14:22:43.069849 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I1121 14:22:43.072259 curator.go:262] Detecting our own stale lock, attempting to remove...
root.role.controlplane.launcher.curator I1121 14:22:43.072834 curator.go:284] Cleanup successful
root.role.controlplane.launcher.curator I1121 14:22:43.076091 curator.go:160] Curator established lease, ID: 1828052371382254860
root.role.controlplane.launcher.curator I1121 14:22:43.083734 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator I1121 14:22:43.083784 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8bf2430cbd0c", rev 292144251)

All RPCs are now being served correctly, leadership-gated transactions are going through, and most nodes are reported as HEALTHY.

lorenz commented 6 months ago

After updating a consensus node we hit this again, this time with the enhanced logging introduced in https://review.monogon.dev/c/monogon/+/2341

root.role.controlplane.launcher.curator.listener.rpc I0124 13:12:37.143673 impl_leader.go:119] Span 17ad4ab0d21c92a4: txnAsLeader(...): rejected (lost leadership (key /leader/195e8d381e8d0612 should've been at rev 1326801110, is at rev no revision?)

It looks like the key got lost. No unusual behavior is seen in etcd/consensus logs.

lorenz commented 6 months ago
root.role.controlplane.launcher.curator I0123 20:59:00.205738 curator.go:311] Waiting for consensus...
root.role.controlplane.launcher.curator I0123 20:59:03.395086 curator.go:318] Got consensus, starting up...
root.role.controlplane.launcher.curator I0123 20:59:03.395142 curator.go:343] Curator starting on prefix "/leader" with lease TTL of 10 seconds...
root.role.controlplane.launcher.curator I0123 20:59:03.395251 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I0123 20:59:03.396302 curator.go:262] Detecting our own stale lock, attempting to remove...
root.role.controlplane.launcher.curator I0123 20:59:03.397229 curator.go:284] Cleanup successful
root.role.controlplane.launcher.curator I0123 20:59:03.397611 curator.go:160] Curator established lease, ID: 1828053770929243666
root.role.controlplane.launcher.curator I0123 20:59:03.398242 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator I0123 20:59:03.398259 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8d381e8d0612", rev 1326801110)
root.role.controlplane.launcher.curator W0123 20:59:30.524476 curator.go:222] Session done, quitting.
root.role.controlplane.launcher.curator W0123 20:59:33.945168 curator.go:156] Failed to close session: etcdserver: requested lease not found
root.role.controlplane.launcher.curator I0123 20:59:33.945186 curator.go:354] Curator election round done: curator session done
root.role.controlplane.launcher.curator I0123 20:59:33.945195 curator.go:355] Curator election restarting...
root.role.controlplane.launcher.curator I0123 20:59:33.945379 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I0123 20:59:36.678884 curator.go:160] Curator established lease, ID: 1828053770929245320
root.role.controlplane.launcher.curator I0123 20:59:36.895395 curator.go:306] Election watcher: this node's curator is a follower
root.role.controlplane.launcher.curator I0123 20:59:36.896690 curator.go:206] Curator became leader.
root.role.controlplane.launcher.curator I0123 20:59:36.896744 curator.go:304] Election watcher: this node's curator is leader (lock key "/leader/195e8d381e8d0c88", rev 1326802054)

It looks like we are checking for the wrong lease, the assertion is for the first lease which is no longer current for multiple hours.

leoluk commented 6 months ago

As evidenced by:

$ metroctl node describe --endpoints [redacted] | grep HEALTHY | wc -l 
358
$ kubectl get nodes | grep -w Ready | wc -l 
989
lorenz commented 5 months ago

A fix was merged in https://review.monogon.dev/c/monogon/+/2722, deployment does no longer show the issue.