Open ErikLundJensen opened 1 week ago
Please provide some logs to understand how does the partition disappear in your case.
A screenshot from the console as these logs never reach our centralized log server. When /var is unavailable then a lot breaks..
We did see IO errors (timeouts) in the console as well but did not capture that.
So this is quite expected, it has nothing to do with mounting (at least until there's enough logs to prove the opposite).
The partition is mounted, but as it's a network disk, any operation would be broken if the network is unreliable. Talos works without issues e.g. on AWS/EBS volumes, so the network volume should be made reliable enough first.
but why did the partition not show up again after the network connectivity was re-established?
I don't know. There are zero logs on partitions being unmounted (it shouldn't be).
You can grab kernel logs with talosctl dmesg
and inspect it yourself to see if the partition is unmounted in any way.
I'll try to see if I can recreate it in a lab environment and then get the logs.
Bug Report
Given Talos OS disk is provided from network storage when network storage is unavailable for more than 5 seconds then partitions disappear.
For example the /var partition disappeared at the node. The partition was available again after reboot.
Description
It could be related to the hardcoded timeout of 5 seconds in the mount.go :
It is not clear if other timeouts can cause the partition to disappear as well. If the mount function runs in a reconciliation loop then it is probably the right place to fix the issue.
Alternative could be looking into the general configuration the XFS filesystem to handle errors using the
max_retries
andretry_timeout_seconds
andaction
XFS mount options.Logs
Disk I/O timeouts are seen in logs.
Environment