Closed bughunter2 closed 1 month ago
It's something I've been thinking about a bit those past few days actually, sadly it's also not something we can do too much about unfortunately.
I see a few things that we should do to improve things a bit though:
The ICMP check should help with false positives and the last item will make it easier for someone to implement a STONITH type mechanism around it. Basically you'd have a small daemon running on a management system monitoring lifecycle events coming from your cluster. When a server is marked as defective and auto-healing is triggered, that daemon connects to the BMC or PDU and cuts power to the dead server.
It's essentially the only way to handle this as an even partly disconnected server will not be able to kill off the running containers/VMs without either causing immediate writes (if storage is still available somehow) or hanging until storage is available again, then causing writes at that point.
The only way to prevent any concurrent writes and limit the damage to any data that wasn't written yet, is by cutting power to the machine.
Another possible problem is IP conflicts when containers/VMs use static IP addressing and they keep running (invisibly) on the original Incus member after automatic evacuation has taken place.
As for filesystem corruption: it may be useful for users to know this can be prevented by using local storage and instance backups instead (backups should always be made regardless), although this of course prevents the use of automatic evacuation entirely. It's a trade-off, and the choice will have to be evaluated per cluster environment.
Required information
incus info
Issue description
Filesystem corruption occurs in the following case: If an Incus cluster member becomes unreachable (i.e., due to a partition in the network), automatic evacuation may happen (if enabled). However, once the Incus cluster member becomes connected to the other members again, filesystem corruption can occur on any container that was running on that member. While those containers have indeed been migrated thanks to the automatic evacuation, they are also still running (invisibly) on the Incus member that became unreachable. That's one of the causes of the filesystem corruption. Some filesystem corruption may be inevitable, simply because the containers were running and then became disconnected from the network. That's acceptable. However, further filesystem corruption can occur because the containers actually keep running (invisibly) on the Incus member. Further details below.
Steps to reproduce
This is an example I ran in QEMU/KVM.
The context: an Incus cluster with 3 nodes and also a Ceph cluster on the same 3 nodes.
The cluster members are called:
incus-n1, incus-n2, incus-n3
Steps to trigger filesystem corruption:Automatic evacuation is enabled:
incus config set cluster.healing_threshold 30
Now disconnect
incus-n1
from the network.Wait until automatic evacuation happens.
Make some changes on the automatically migrated launched container, such as writing something to
/root/.bashrc
In my case, after doing that, I ransync
to flush the I/O buffer to disk.Connect
incus-n1
again to the network.Even though
incus-n1
agrees about the cluster's state (that debian0 now really runs onincus-n2
), there's still an invisible container running onincus-n1
, which was the previousdebian0
container before the automatic migration took place.In the Incus cluster, stop the real debian0 container, the one that runs on
incus-n2
. Issuing that stop command indeed stops the container onincus-n2
, as expected. Afterwards, Incus reports the container's state as STOPPED. However, onincus-n1
, that container is still running (visible viaps auxfww
), even though Incus doesn't report it. I don't expect Incus to see the container, since it has already been migrated. But this situation can cause filesystem corruption if we are using distributed storage like Ceph.(If you're lucky, it might not cause corruption. But it seems more likely that it will cause filesystem corruption.)
In the output of
ps auxfww
we can see that the container is indeed still running: I've referred to this earlier as the invisible container.Unfortunately, this can cause filesystem corruption, since we use Ceph. This is because the Ceph RBD was still mounted on
incus-n1
, and the invisible container wrote data to it later on, whenincus-n1
got reconnected to the network again.Running
e2fsck
reveals the container filesystem corruption.First, I made sure to reboot the Incus member on which the invisible container was running. I also made sure the container wasn't running anywhere.
Then I mapped the Ceph RBD and ran
e2fsck
, like so:Although I think this is worth reporting, I'm not sure what Incus can do about it. Note that the above is only a problem if we are using remote storage. If the container was using local storage, there wouldn't have been any filesystem corruption. For Incus that will mean it only has to handle this case when it's part of an Incus cluster and the container is using remote storage like Ceph.
What could the
incus-n1
node have done? That's the node we disconnected from the network to simulate a real-world problem. To make matters even more complicated, situations can arise whereby one or more Incus nodes might have lost contact with each other, but the Ceph nodes haven't, for whatever reason (for example because the Ceph nodes could be located elsewhere instead of running on the same Incus nodes). From the standpoint of container/VM availability, it would be desirable if containers/VMs kept running. But that might mean that Incus can't ever handle this situation in a way that prevents filesystem corruption if automatic evacuation is enabled.So I guess this boils down to the CAP theorem and the CP vs. AP choice. I'd like to avoid filesystem corruption and hence favor consistency.
If Incus wants to handle this, it might mean that the
incus-n1
node has to decide that it is the problematic node (even though it can't know for sure) and hope that the other nodes still are in quorum. Then, theincus-n1
node could decide to forcefully stop the container, thereby preventing future writes in case theincus-n1
re-establishes contact with the other nodes. This would prevent the above-mentioned filesystem corruption.To emphasize, some filesystem corruption is understandable, since containers that are running suddenly become disconnected from their remote storage provider (Ceph). However, further container filesystem corruption occurs once the member is connected to the network again, because the container keeps running (invisibly) on that Incus member after the container has been migrated by automatic evacuation.
Just thinking out loud here. I don't have a solution, per se.
One question is how to detect the situation. The other question is how to handle it.
Information to attach
dmesg
)incus info NAME --show-log
)incus config show NAME --expanded
)incus monitor --pretty
while reproducing the issue)