hostsfile: logic error causing lost cluster directory

lorenz commented 1 year ago

The main update loop has a changed variable, which is set to true if either a local address change or a cluster change happened. Problem is if there is no curator or the curator has not been contacted yet, the nodes variable in the runnable does not contain any non-local nodes. Thus if the local address is updated, the hostsfile service writes a cluster directory with only the local node to disk, rendering the node unbootable without intervention.

q3k commented 1 year ago

So this happens in the following scenario, right?

hostsfile is running, has both local node and cluster data
hostsfile writes node/cluster data into /etc/hosts and the cluster directory
hostsfile restarts (or whole node restarts)
hostsfile starts up and only has local node
hostsfile writes node data into /etc/hosts and the cluster directory which contains only the local node
effectively we lose the cluster directory and /etc/hosts data, and likely cause the node to never to be able to connect to the cluster

I.e. the bug is that hostsfile might squash a perfectly valid /etc/hosts / CD on startup without taking into account whatever might have been already there, and that causes the entire node to perhaps not be able to connect to the cluster ever again?

lorenz commented 1 year ago

Yes, that's my understanding.

monogon-dev / monogon

hostsfile: logic error causing lost cluster directory #228