siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 555 forks source link

Partitions (backed by network storage) disappear if network is unavailable for more than 5 seconds #9706

Open ErikLundJensen opened 1 week ago

ErikLundJensen commented 1 week ago

Bug Report

Given Talos OS disk is provided from network storage when network storage is unavailable for more than 5 seconds then partitions disappear.

For example the /var partition disappeared at the node. The partition was available again after reboot.

Description

It could be related to the hardcoded timeout of 5 seconds in the mount.go :

func (p *Point) retry(f func() error, isUnmount bool, printerOptions PrinterOptions) error {
    return retry.Constant(5*time.Second, retry.WithUnits(50*time.Millisecond)).Retry(func() error {
        if err := f(); err != nil {
            switch err {
            case unix.EBUSY:
                return retry.ExpectedError(err)
            case unix.ENOENT, unix.ENXIO:
                // if udevd triggers BLKRRPART ioctl, partition device entry might disappear temporarily
                return retry.ExpectedError(err)

It is not clear if other timeouts can cause the partition to disappear as well. If the mount function runs in a reconciliation loop then it is probably the right place to fix the issue.

Alternative could be looking into the general configuration the XFS filesystem to handle errors using the max_retries and retry_timeout_seconds and action XFS mount options.

Logs

Disk I/O timeouts are seen in logs.

Environment

smira commented 1 week ago

Please provide some logs to understand how does the partition disappear in your case.

ErikLundJensen commented 1 week ago

A screenshot from the console as these logs never reach our centralized log server. When /var is unavailable then a lot breaks.. no-var-folder

We did see IO errors (timeouts) in the console as well but did not capture that.

smira commented 1 week ago

So this is quite expected, it has nothing to do with mounting (at least until there's enough logs to prove the opposite).

The partition is mounted, but as it's a network disk, any operation would be broken if the network is unreliable. Talos works without issues e.g. on AWS/EBS volumes, so the network volume should be made reliable enough first.

ErikLundJensen commented 1 week ago

but why did the partition not show up again after the network connectivity was re-established?

smira commented 1 week ago

I don't know. There are zero logs on partitions being unmounted (it shouldn't be).

You can grab kernel logs with talosctl dmesg and inspect it yourself to see if the partition is unmounted in any way.

ErikLundJensen commented 1 week ago

I'll try to see if I can recreate it in a lab environment and then get the logs.