On an OpenShift 4.13 cluster, the cifsd process causes CPU hangs

berendiwema commented 4 months ago

I'm not sure if cifsd is a part of this driver, or is supplied by the host OS. I wasn't able to locate the cifsd process either on the file system of the affected hosts. Furthermore, reading the source code did not make it clear for me if cifds is a part of this driver or not.

I hope someone is familiair with issues like this and might know a way to mitigate it.

What happened: On several nodes within an OpenShift 4.13 cluster we see nodes with hanging CPUs due to cifsd driver issues.

What you expected to happen: The CIFSD driver does not cause hanging CPU's.

How to reproduce it: Difficult: looks like network issues cause the share to hang or a lock to timeout, but we haven't been able to pinpoint it.

Anything else we need to know?: System logs show:

[920285.608500] watchdog: BUG: soft lockup - CPU#1 stuck for 3703s! [.NET ThreadPool:2647689]
[920289.653493] watchdog: BUG: soft lockup - CPU#15 stuck for 3707s! [cifsd:17906]
[920301.643461] watchdog: BUG: soft lockup - CPU#12 stuck for 3595s! [cifsd:18471]
[920305.624468] watchdog: BUG: soft lockup - CPU#6 stuck for 2295s! [cifsd:18190]
[920313.608432] watchdog: BUG: soft lockup - CPU#1 stuck for 3729s! [.NET ThreadPool:2647689]
[920317.653421] watchdog: BUG: soft lockup - CPU#15 stuck for 3733s! [cifsd:17906]
[920322.866740] systemd[1]: Failed to start Journal Service.
[920329.643386] watchdog: BUG: soft lockup - CPU#12 stuck for 3621s! [cifsd:18471]
[920331.397393] rcu: INFO: rcu_preempt self-detected stall on CPU
[920331.402183] rcu:     15-....: (4019573 ticks this GP) idle=93d/1/0x4000000000000000 softirq=100792278/100814899 fqs=928269 
[920332.922398] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... 15-... } 4019285 jiffies s: 4880525 root: 0x8002/.
[920332.924796] rcu: blocking rcu_node structures (internal RCU debug):
[920333.624382] watchdog: BUG: soft lockup - CPU#6 stuck for 2321s! [cifsd:18190]
[920341.608365] watchdog: BUG: soft lockup - CPU#1 stuck for 3755s! [.NET ThreadPool:2647689]
[920357.643318] watchdog: BUG: soft lockup - CPU#12 stuck for 3647s! [cifsd:18471]
[920357.653318] watchdog: BUG: soft lockup - CPU#15 stuck for 3770s! [cifsd:17906]
[920361.624311] watchdog: BUG: soft lockup - CPU#6 stuck for 2347s! [cifsd:18190]
[920369.608292] watchdog: BUG: soft lockup - CPU#1 stuck for 3781s! [.NET ThreadPool:2647689]
[920385.643249] watchdog: BUG: soft lockup - CPU#12 stuck for 3673s! [cifsd:18471]
[920385.653254] watchdog: BUG: soft lockup - CPU#15 stuck for 3796s! [cifsd:17906]
[920389.624241] watchdog: BUG: soft lockup - CPU#6 stuck for 2373s! [cifsd:18190]
[920397.608226] watchdog: BUG: soft lockup - CPU#1 stuck for 3808s! [.NET ThreadPool:2647689]
[920398.458234] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... 15-... } 4084821 jiffies s: 4880525 root: 0x8002/.
[920398.460628] rcu: blocking rcu_node structures (internal RCU debug):

Environment:

CSI Driver version: registry.k8s.io/sig-storage/smbplugin:v1.14.0
Kubernetes version (use kubectl version): Kubernetes Version: v1.26.13+8f85140
OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 413.92.202402131523-0 (Plow)
Kernel (e.g. uname -a): 5.14.0-284.52.1.el9_2.x86_64

andyzhangx commented 4 months ago

cifsd is NOT part of this driver, it's supplied by the host

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

berendiwema commented 1 week ago

/close

kubernetes-csi / csi-driver-smb

On an OpenShift 4.13 cluster, the cifsd process causes CPU hangs #772