siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.42k stars 512 forks source link

NFS client fails to disconnect - node become unresponsive #8376

Open Daxcor69 opened 6 months ago

Daxcor69 commented 6 months ago

Bug Report

Description

Problem: Nodes with iowait of 25-45% with context switching in the 30K range. Customers report painful performance loading assets.

Symptoms: During the deletion of a statefulset backed by a volume from an external nfs server, the pod remains in a terminating state. This is NOT using the NFS provisioner (pv/pvc). The only way to remove the pod is kubectl delete pods podname-0 --force --graceful-delay=0. The pod does get removed.

spec:
  volumes:
    - name: data
      nfs:
        server: nfs1.storage.server.com
        path: /home/pete

During a node reboot these "stuck" processes are listed as "un able to terminate" but the node is eventually rebooted. IOwait and context switching goes a away.

Theory: Even though the pod is removed from kuberentes, the linux process on the node is never terminated fully. It remains in a state such that it thinks it is waiting on data from the nfs mount like a really really big file that never finishes loading. The more of these "zombie" processes the greater the iowait on the node becomes.

Environment

smira commented 6 months ago

I won't recommend to use NFS today, as it was designed for a totally different usecase.

It's not expected though to have issues as long as NFS server is still responsive. Once NFS server becomes unresponsive, things go wrong way with NFS, which can be partially mitigated with NFS mount options.

I'm not quite sure what in this issue can be attributed to Talos Linux, or anything missing in Talos Linux itself, as NFS is implemented in the kernel, and there's not much there we can do on the OS side vs. the things you can configure yourself.

Daxcor69 commented 6 months ago

When I asked about this in Discord, I got the following message "there's a problem with NFSv4 due to missing statsd if I remember correctly". Does this mean v4 is not supported in Talos?

So I just trying to sort it out. I know nfs is not ideal, I get that. Prior to migrating to Talos, nfsv4 worked without the current issue I am having. So is this an issue of nfsv3?

smira commented 6 months ago

NFSv4 user-space daemons are not enabled, but I believe it won't mount simply with v4.