kubernetes-csi / csi-driver-smb

This driver allows Kubernetes to access SMB Server on both Linux and Windows nodes.
Apache License 2.0
481 stars 130 forks source link

On an OpenShift 4.13 cluster, the cifsd process causes CPU hangs #772

Closed berendiwema closed 1 week ago

berendiwema commented 4 months ago

I'm not sure if cifsd is a part of this driver, or is supplied by the host OS. I wasn't able to locate the cifsd process either on the file system of the affected hosts. Furthermore, reading the source code did not make it clear for me if cifds is a part of this driver or not.

I hope someone is familiair with issues like this and might know a way to mitigate it.

What happened: On several nodes within an OpenShift 4.13 cluster we see nodes with hanging CPUs due to cifsd driver issues.

What you expected to happen: The CIFSD driver does not cause hanging CPU's.

How to reproduce it: Difficult: looks like network issues cause the share to hang or a lock to timeout, but we haven't been able to pinpoint it.

Anything else we need to know?: System logs show:

[920285.608500] watchdog: BUG: soft lockup - CPU#1 stuck for 3703s! [.NET ThreadPool:2647689]
[920289.653493] watchdog: BUG: soft lockup - CPU#15 stuck for 3707s! [cifsd:17906]
[920301.643461] watchdog: BUG: soft lockup - CPU#12 stuck for 3595s! [cifsd:18471]
[920305.624468] watchdog: BUG: soft lockup - CPU#6 stuck for 2295s! [cifsd:18190]
[920313.608432] watchdog: BUG: soft lockup - CPU#1 stuck for 3729s! [.NET ThreadPool:2647689]
[920317.653421] watchdog: BUG: soft lockup - CPU#15 stuck for 3733s! [cifsd:17906]
[920322.866740] systemd[1]: Failed to start Journal Service.
[920329.643386] watchdog: BUG: soft lockup - CPU#12 stuck for 3621s! [cifsd:18471]
[920331.397393] rcu: INFO: rcu_preempt self-detected stall on CPU
[920331.402183] rcu:     15-....: (4019573 ticks this GP) idle=93d/1/0x4000000000000000 softirq=100792278/100814899 fqs=928269 
[920332.922398] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... 15-... } 4019285 jiffies s: 4880525 root: 0x8002/.
[920332.924796] rcu: blocking rcu_node structures (internal RCU debug):
[920333.624382] watchdog: BUG: soft lockup - CPU#6 stuck for 2321s! [cifsd:18190]
[920341.608365] watchdog: BUG: soft lockup - CPU#1 stuck for 3755s! [.NET ThreadPool:2647689]
[920357.643318] watchdog: BUG: soft lockup - CPU#12 stuck for 3647s! [cifsd:18471]
[920357.653318] watchdog: BUG: soft lockup - CPU#15 stuck for 3770s! [cifsd:17906]
[920361.624311] watchdog: BUG: soft lockup - CPU#6 stuck for 2347s! [cifsd:18190]
[920369.608292] watchdog: BUG: soft lockup - CPU#1 stuck for 3781s! [.NET ThreadPool:2647689]
[920385.643249] watchdog: BUG: soft lockup - CPU#12 stuck for 3673s! [cifsd:18471]
[920385.653254] watchdog: BUG: soft lockup - CPU#15 stuck for 3796s! [cifsd:17906]
[920389.624241] watchdog: BUG: soft lockup - CPU#6 stuck for 2373s! [cifsd:18190]
[920397.608226] watchdog: BUG: soft lockup - CPU#1 stuck for 3808s! [.NET ThreadPool:2647689]
[920398.458234] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... 15-... } 4084821 jiffies s: 4880525 root: 0x8002/.
[920398.460628] rcu: blocking rcu_node structures (internal RCU debug):

Environment:

andyzhangx commented 4 months ago

cifsd is NOT part of this driver, it's supplied by the host

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

berendiwema commented 1 week ago

/close