monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 9 forks source link

hostsfile: logic error causing lost cluster directory #228

Closed lorenz closed 1 year ago

lorenz commented 1 year ago

The main update loop has a changed variable, which is set to true if either a local address change or a cluster change happened. Problem is if there is no curator or the curator has not been contacted yet, the nodes variable in the runnable does not contain any non-local nodes. Thus if the local address is updated, the hostsfile service writes a cluster directory with only the local node to disk, rendering the node unbootable without intervention.

q3k commented 1 year ago

So this happens in the following scenario, right?

  1. hostsfile is running, has both local node and cluster data
  2. hostsfile writes node/cluster data into /etc/hosts and the cluster directory
  3. hostsfile restarts (or whole node restarts)
  4. hostsfile starts up and only has local node
  5. hostsfile writes node data into /etc/hosts and the cluster directory which contains only the local node
  6. effectively we lose the cluster directory and /etc/hosts data, and likely cause the node to never to be able to connect to the cluster

I.e. the bug is that hostsfile might squash a perfectly valid /etc/hosts / CD on startup without taking into account whatever might have been already there, and that causes the entire node to perhaps not be able to connect to the cluster ever again?

lorenz commented 1 year ago

Yes, that's my understanding.