monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 8 forks source link

Curator breaks under high rates of node changes #260

Closed lorenz closed 4 months ago

lorenz commented 11 months ago
root.role.controlplane.launcher.curator.listener.rpc I0803 17:36:43.887279 impl_leader_curator.go:115] Span 1777f01f7b7cec9e: etcd watch failed (initial fetch): node ID mismatch (etcd key: "metropolis-f051eaf799695c2bc855d5a6e156b2d1", value: "metropolis-b4ec87d6e5d318e6cd667c9558565962")
q3k commented 10 months ago

Havent' been able to figure out a code path that could've caused this. We're quite defensive when saving nodes, and use a single node save function that ensures we save with the correct key for the node.

Obviously, the bug exists, but at this point I'm not sure where to look. Next time this occurs we should get a full disk copy so that we can do some forensics.

jscissr commented 4 months ago

A bug which could cause this was fixed in d6a88022d0c8ffdd2b938f161b8825148762b099.

q3k commented 4 months ago

Yeah, this was likely the bug:

  1. https://github.com/monogon-dev/monogon/blob/main/metropolis/node/core/curator/state_node.go#L223 Here, we instantiate an etcdPrefix as a package-level variable.
  2. https://github.com/monogon-dev/monogon/blob/main/metropolis/node/core/curator/state_node.go#L397 Here, we use it concurrently from nodeLoad/nodeSave calls.

This means that a race between node{Load,Save} calls (either same function, or different calls) could've lead to a data race and thus etcd corruption.

q3k commented 4 months ago

Optimistically closing this. Thanks, @jscissr !