monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 8 forks source link

node: version status inaccurate #295

Closed lorenz closed 3 months ago

lorenz commented 3 months ago

Under some circumstances the node version reported in the node status is inaccurate (i.e. doesn't get updated to the current version). This occurs on approximately 1% of nodes, but it might be correlated with control plane downtime during the update.

lorenz commented 3 months ago

Possibly also caused by the control plane being updated very late, meaning the fix for the data race was not in place for a large portion of the update run. In that case this is already fixed.

lorenz commented 3 months ago

Looks like a logic bug in workerStatusPushLoop. The log from an affected node contains the following:

root.role.statuspush             I0327 16:30:26.141343 worker_statuspush.go:68] Got new NodeID: metropolis-628db1d67b1380bde8c6d5577109df92
root.role.statuspush             I0327 16:30:26.141354 worker_statuspush.go:74] Got curator connection.
root.role.statuspush             I0327 16:30:26.457418 worker_statuspush.go:60] Got external address: [redacted]
root.role.statuspush             I0327 16:30:26.458316 worker_statuspush.go:95] Submitting status: "external_address:\"[redacted]\" version:{release:{minor:1} git_information:{commit_hash:\"59f8d0a0\" commits_since_release:547 build_tree_state:BUILD_TREE_STATE_CLEAN}}"
supervisor                       E0327 16:30:33.710292 supervisor_processor.go:303] root.role.statuspush: UpdateNodeStatus failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp [redacted]:7835: connect: connection timed out"
supervisor                       I0327 16:30:33.711644 supervisor_processor.go:509] rescheduling supervised node root.role.statuspush with backoff 688.407171ms
root.role.statuspush             I0327 16:30:34.400391 worker_statuspush.go:60] Got external address: [redacted]
root.role.statuspush             I0327 16:30:34.401991 worker_statuspush.go:68] Got new NodeID: metropolis-628db1d67b1380bde8c6d5577109df92
root.role.statuspush             I0327 16:30:34.401998 worker_statuspush.go:74] Got curator connection.
root.role.statuspush             I0327 16:30:34.402068 worker_statuspush.go:95] Submitting status: "external_address:\"[redacted]\" version:{release:{minor:1} git_information:{commit_hash:\"59f8d0a0\" commits_since_release:547 build_tree_state:BUILD_TREE_STATE_CLEAN}}"
supervisor                       E0327 16:30:34.402093 supervisor_processor.go:303] root.role.statuspush: UpdateNodeStatus failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp [redacted]:7835: connect: connection timed out"
supervisor                       I0327 16:30:34.407908 supervisor_processor.go:509] rescheduling supervised node root.role.statuspush with backoff 748.815372ms
root.role.statuspush             I0327 16:30:35.158219 worker_statuspush.go:68] Got new NodeID: metropolis-628db1d67b1380bde8c6d5577109df92
root.role.statuspush             I0327 16:30:35.158228 worker_statuspush.go:74] Got curator connection.
root.role.statuspush             I0327 16:30:35.158276 worker_statuspush.go:60] Got external address: [redacted]
root.role.statuspush             I0327 16:30:35.158331 worker_statuspush.go:95] Submitting status: "external_address:\"[redacted]\" version:{release:{minor:1} git_information:{commit_hash:\"59f8d0a0\" commits_since_release:547 build_tree_state:BUILD_TREE_STATE_CLEAN}}"
supervisor                       E0327 16:30:35.158355 supervisor_processor.go:303] root.role.statuspush: UpdateNodeStatus failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp [redacted]:7835: connect: connection timed out"

So it looks like what is happening is that an update is triggered, the UpdateNodeStatus RPC fails because the control plane is unavailable and then the runnable just exits, dropping the status update on the floor as after workerStatusPushLoop waits for a new status change to happen instead of continuing to try to send the current status.

/cc @q3k

jscissr commented 3 months ago

To me, this does not look like a bug in workerStatusPush. If the update RPC fails, the entire runnable exits, including all value watchers. After restarting, it then attempts the update RPC again once it has got all the initial values. I also see this in the logs you pasted.

q3k commented 3 months ago

The bug seems to be that the runnable doesn't get restarted because one of it's children is stuck:

q3k@sizeableunit ~ $ metroctl node logs metropolis-628db1d67b1380bde8c6d5577109df92 --endpoints [redacted] --dn supervisor --backlog 10
=== Logs from metropolis-628db1d67b1380bde8c6d5577109df92 ([redacted]):
supervisor                       W0327 17:31:53.895200 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:54.049033 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:54.049137 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:54.049263 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:54.954099 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:54.954145 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:54.954238 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:55.314450 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:55.314531 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
supervisor                       W0327 17:31:55.314545 supervisor_processor.go:437] Not restarting root.role.statuspush: children not ready to be restarted: [root.role.statuspush.pipe-local-control-plane]
q3k commented 3 months ago

This is likely the cause: https://review.monogon.dev/c/monogon/+/2901

lorenz commented 3 months ago

Most likely fixed, not observed on a rollout with the above CL.