Closed dasavick closed 2 years ago
Hi @PinkLolicorn! Thanks for the detailed report on this! As it turns out the initial failure on the client is a known issue https://github.com/hashicorp/nomad/issues/11477, which I've already got a fix in-flight for in https://github.com/hashicorp/nomad/pull/12113. That'll ship in the upcoming Nomad 1.3.0 (and get backported to 1.2.x and 1.1.x).
It looks like on the server you're seeing https://github.com/hashicorp/nomad/issues/11758 which is going to be fixed in https://github.com/hashicorp/nomad/pull/12114.
I'm less sure about the error:
server is locked
Because that's bubbling up from the Hetzner plugin you're using.
You might be able to recover from the looping volumewatcher errors by ensuring the plugin is running on the old node.
@PinkLolicorn I'm going to close this as a duplicate because the underlying issue is known in another issue and staged to be fixed, but I'm happy to keep chatting about how to fix up the volumewatcher error if possible.
Because that's bubbling up from the Hetzner plugin you're using.
Seems that a moment later the error was gone. I thought it might have been the VM that is turned off, but looks like action on two separate VMs failed with this one. My best guess is this might have been the action of detaching on the Hetzner side, not being cleared yet after the previous attempt.
You might be able to recover from the looping volumewatcher errors by ensuring the plugin is running on the old node.
Unfortunately this does not seem to work, these two nodes that were turned off have the task "running" and manually checking the containers there is nothing obvious wrong:
root@hashi-fsn-client-1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5f58c3d65be2 hetznercloud/hcloud-csi-driver:1.6.0 "/bin/hcloud-csi-dri…" 15 hours ago Up 15 hours plugin-044bf034-ae3e-26f5-7ea0-0b868dc1e60d
root@hashi-fsn-client-1:~# docker logs --tail 1000 plugin-044bf034-ae3e-26f5-7ea0-0b868dc1e60d
level=info ts=2022-02-24T00:02:41.496054884Z msg="fetched server" server-name=hashi-fsn-client-1
root@hashi-fsn-client-3:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
351e8422c506 hetznercloud/hcloud-csi-driver:1.6.0 "/bin/hcloud-csi-dri…" 15 hours ago Up 15 hours plugin-c06f6743-ea63-ce1b-13e7-6f53b53a1d3e
root@hashi-fsn-client-3:~# docker logs --tail 1000 plugin-c06f6743-ea63-ce1b-13e7-6f53b53a1d3e
level=info ts=2022-02-24T00:31:05.252738453Z msg="fetched server" server-name=hashi-fsn-client-3
The cluster leader server is still logging the same errors:
Feb 24 15:13:11 hashi-fsn-server-1 nomad[1790]: 2022-02-24T15:13:11.808Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]: error=
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]: | 2 errors occurred:
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]: | * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]: | * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]: |
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:
The UI is still showing 2/4 controllers/nodes.
This is a bit more confusing, 4 instances but 2 per client with the same ID? There is only a single CSI plugin container running on any of the clients.
dockerroot@hashi-fsn-client-4:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
142bbb0863eb hetznercloud/hcloud-csi-driver:1.6.0 "/bin/hcloud-csi-dri…" 16 hours ago Up 16 hours plugin-49261906-b8a4-8c4a-663a-7d69511570ab
root@hashi-fsn-client-2:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
843c9d8cb5ea mysql:5-debian "docker-entrypoint.s…" 15 hours ago Up 15 hours 33060/tcp, ADDR:PORT->3306/tcp, ADDR:PORT->3306/udp mysql-babf54f5-6ab0-5c14-525e-ff57e063b405
99af28c5d142 redis:6 "docker-entrypoint.s…" 15 hours ago Up 15 hours ADDR:PORT->6379/tcp, ADDR:PORT->6379/udp redis-7768653d-6f82-c43e-2cf8-ba1c145b2046
9188a27b4fbf hetznercloud/hcloud-csi-driver:1.6.0 "/bin/hcloud-csi-dri…" 16 hours ago Up 16 hours plugin-f8d7ff25-627d-edbc-f923-7b99c8df0bb3
And the job itself:
As it turns out the initial failure on the client is a known issue https://github.com/hashicorp/nomad/issues/11477, which I've already got a fix in-flight for in https://github.com/hashicorp/nomad/pull/12113. That'll ship in the upcoming Nomad 1.3.0 (and get backported to 1.2.x and 1.1.x).
That's great! Any schedule for the 1.3.0 release and backports?
Seems that a moment later the error was gone. I thought it might have been the VM that is turned off, but looks like action on two separate VMs failed with this one. My best guess is this might have been the action of detaching on the Hetzner side, not being cleared yet after the previous attempt.
Ok, the plugin allocation logs you have suggest that Nomad is sending the detach RPCs but they're failing. If you're not seeing the "server is locked" error anymore, are you seeing logs for the detach RPCs?
This is a bit more confusing, 4 instances but 2 per client with the same ID? There is only a single CSI plugin container running on any of the clients.
Your plugin is of type monolith
, which means it serves both the controller and node RPCs. This is treated as separate plugin instances because Nomad (or k8s for that matter) has to make decisions about which plugin instance to talk to differently for both. That has a little bit of a goofy side effect in how it gets modeled in the web UI.
Any schedule for the 1.3.0 release and backports?
I can't give dates for our releases, but I can say it's the currently active line of work across the whole team. In the meantime, feel free to check out the main
branch and give a build a try!
My previous bugged state was "wiped" due to other experiments on the cluster, but after some quick dive into the Hetzner's CSI plugin, I was able to figure the meaning of the status:
ErrorCodeLocked ErrorCode = "locked" // Item is locked (Another action is running)
Bringing back part of the log, my knowledge about the architecture is limited, but I think we can see: 1) attach action failing with "volume is attached" (client-1, then client-2, both previously dead - targeting client-2) 2) detach action failing with "server is locked" (client-2 - targeting self previously dead?) 3) detach action failing with "server is locked" (client-4 - targeting client-1 previously dead) 4) ... other actions that I'm not sure what to think of because I might have detached the volume manually at this time ...
level=info ts=2022-02-23T23:17:13.563580702Z msg="fetched server" server-name=hashi-fsn-client-4
level=info ts=2022-02-23T23:36:47.371589294Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:36:47.559393123Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:36:47.603606394Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
level=info ts=2022-02-23T23:38:47.650302102Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-3
level=info ts=2022-02-23T23:38:47.884207741Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:38:47.925305127Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
level=info ts=2022-02-23T23:39:20.971902903Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:39:21.07116093Z component=api-volume-service msg="failed to detach volume" volume-id=MY_VOLUME_ID err="cannot perform operation because server is locked (locked)"
level=error ts=2022-02-23T23:39:21.07122513Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to unpublish volume: server is locked"
level=info ts=2022-02-23T23:42:47.973406664Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-4
level=info ts=2022-02-23T23:47:18.516977647Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:47:18.629824577Z component=api-volume-service msg="failed to detach volume" volume-id=MY_VOLUME_ID err="cannot perform operation because server is locked (locked)"
level=error ts=2022-02-23T23:47:18.629904217Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to unpublish volume: server is locked"
level=info ts=2022-02-23T23:47:21.627957436Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:47:21.669399184Z component=api-volume-service msg="volume not attached to a server" volume-id=MY_VOLUME_ID
level=info ts=2022-02-23T23:48:44.41762282Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-4
level=info ts=2022-02-23T23:48:44.64604037Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:48:44.67891215Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
level=info ts=2022-02-23T23:48:45.417872798Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-2
???
level=info ts=2022-02-23T23:51:48.163935969Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:51:48.392417837Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-1
level=error ts=2022-02-23T23:51:48.426822191Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
So, to my understanding, it just means that any action in the scope of the specific server that was not yer cleared internally can yield this. A bit weird but would make sense: at first client-1/2 targeted attach to client-2, so considering 1 & 2 temporarily locked seeing then two "server is locked" messages for detach actions targeting these makes sense.
I wonder if this was not the case due to the better handling in the CSI plugin or more delay between attach and detach cycles, the cluster would recover correctly from the failure, or at least partially. That also makes me wonder why this is even possible, we have the answer volume is attached
so the next action should be free to go. I think I'm going to try to reproduce this without nomad and open a separate issue for the hetznercloud/csi-driver later.
I wonder if this was not the case due to the better handling in the CSI plugin or more delay between attach and detach cycles, the cluster would recover correctly from the failure, or at least partially. That also makes me wonder why this is even possible, we have the answer volume is attached so the next action should be free to go. I think I'm going to try to reproduce this without nomad and open a separate issue for the hetznercloud/csi-driver later.
The concurrency spec for CSI is a little "fuzzy" in that both the CO (Nomad) is supposed to avoid sending multiple in-flight requests for a given volume and the plugin is supposed to handle that case gracefully anyways. I suspect the authors of the plugin are relying on the k8s control-loop architecture to just go ahead and eventually re-sync, which we didn't have in place in Nomad (but the Nomad 1.3.0 patches get us to an equivalent behavior).
I'd definitely be interested to pop in to any issue you open up on the driver repo, so feel free to ping me there!
In the meantime, feel free to check out the
main
branch and give a build a try!
Just tried this, but I have trouble getting my CSI job to run on the dev version. Were there any changes to the environment variables recently? No matter if I set CSI_ENDPOINT or not, the plugin does not pick it up correctly. It worked just fine before with the same job file.
level=error ts=2022-03-10T14:30:21.717956732Z msg="endpoint must start with unix://"
// CreateListener creates and binds the unix socket in location specified by the CSI_ENDPOINT environment variable.
func CreateListener() (net.Listener, error) {
endpoint := os.Getenv("CSI_ENDPOINT")
if endpoint == "" {
return nil, errors.New("you need to specify an endpoint via the CSI_ENDPOINT env var")
}
if !strings.HasPrefix(endpoint, "unix://") {
return nil, errors.New("endpoint must start with unix://")
}
endpoint = endpoint[7:] // strip unix://
if err := os.Remove(endpoint); err != nil && !os.IsNotExist(err) {
return nil, errors.New(fmt.Sprintf("failed to remove socket file at %s: %s", endpoint, err))
}
return net.Listen("unix", endpoint)
}
job "system-csi-hetzner" {
region = "eu-central"
datacenters = ["fsn"]
namespace = "system"
type = "system"
group "node" {
task "plugin" {
driver = "docker"
config {
image = "hetznercloud/hcloud-csi-driver:1.6.0"
privileged = true
}
env {
CSI_ENDPOINT = "unix:///csi/csi.sock"
HCLOUD_TOKEN = "xyz"
ENABLE_METRICS = false
}
csi_plugin {
id = "csi.hetzner.cloud"
type = "monolith"
mount_dir = "/csi"
}
resources {
cpu = 50
memory = 16
memory_max = 256
}
}
}
}
Also, is there any complete instruction on installing the required dependencies for the release
or dev-ui
builds? I was unable to find any, and I'm currently using the binary from the go build
.
Just tried this, but I have trouble getting my CSI job to run on the dev version. Were there any changes to the environment variables recently? No matter if I set CSI_ENDPOINT or not, the plugin does not pick it up correctly. It worked just fine before with the same job file.
Yeah, Nomad is setting the CSI_ENDPOINT
variable now as of https://github.com/hashicorp/nomad/pull/12050 So it looks like the Hetzer plugin is expecting that to have unix://
but I haven't see it fail on other plugins (I've tested recently on AWS EBS, AWS EFS, GCP, and Democratic NFS). I feel like we probably shouldn't be setting the environment if it's been set by the user though, so I'll fix that for Nomad 1.3.0 and backport.
Also, is there any complete instruction on installing the required dependencies for the release or dev-ui builds? I was unable to find any, and I'm currently using the binary from the go build.
There's a contributing guide but mostly what you'll want to do is run make bootstrap
to bootstrap all the dependencies, and then at that point you can go build
(or make dev
to get a reasonable set of build flags).
I've opened https://github.com/hashicorp/nomad/pull/12257 with a patch for that, but I need to do some testing which I probably won't get to until tomorrow.
~In my tests in https://github.com/hashicorp/nomad/pull/12257 it looks like the CSI_ENDPOINT
env var isn't being set at all, which is pretty weird. So I still have some work to do to figure out what's going on there.~ Or I should make sure I'm testing against a current build instead of Nomad 1.2.6 😊
Endpoint env var is fixed in #12257. I ended up going with unix:///...
by default as a quick survey of the plugins showed that was expected by more than not.
Hey @PinkLolicorn did you managed to get it up & running using nomad 1.3.0-rc1?
I'm also using hetzner, but today again I just noticed that few jobs went down. I looked into allocation and noticed error:
failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume MY-VOLUME-REDACTED: rpc error: controller publish: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = failed to publish volume: Get "https://api.hetzner.cloud/v1/volumes/VOLUME-ID-REDACTED": unexpected EOF
It seems tho as it might be Hetzner CSI related problem 🤔
Hey @rwojsznis, I actually tested this version yesterday with no success. However, I did not notice this exact error. Here is my setup:
nomad run csi-hetzner.nomad
job "system-csi-hetzner" {
region = "eu-central"
datacenters = ["fsn"]
type = "system"
group "node" {
task "plugin" {
driver = "docker"
config {
image = "hetznercloud/hcloud-csi-driver:1.6.0"
privileged = true
}
env {
CSI_ENDPOINT = "unix:///csi/csi.sock"
HCLOUD_TOKEN = "xyz"
ENABLE_METRICS = false
}
csi_plugin {
id = "csi.hetzner.cloud"
type = "monolith"
mount_dir = "/csi"
}
resources {
cpu = 50
memory = 16
memory_max = 256
}
}
}
}
nomad volume register myvol.nomad
id = "myvol"
name = "myvol"
external_id = 1234
type = "csi"
plugin_id = "csi.hetzner.cloud"
capacity_min = "10G"
capacity_max = "20G"
capability {
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
mount_options {
fs_type = "ext4"
}
My volumes do mount, but I cannot get nomad to reschedule them in case where client goes out, and it is even worse than before, because manual detach from the hetzner UI does not help to recover.
The only solution is to stop the job, deregister volume, register again, and then start. At least clients do not go into unhealthy state forever now.
It basically repeats forever that max claims are reached and cannot unmount the volume:
nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Running nomad system gc
makes the UI look ok, but does not help in reality:
* Constraint "CSI volume XYZ has exhausted its available writer claims and is claimed by a garbage collected allocation b3c9ce46-ca63-d318-efaa-a5ba6f6538fc; waiting for claim to be released": 2 nodes excluded by filter
As stated before, even if volume is unmounted from hetzner UI, nomad does not recover from this state. I also spotted some funky looking logs from the hetzner-csi-plugin:
level=info ts=2022-05-09T23:26:41.438790658Z msg="fetched server" server-name=hashi-fsn-client-1
level=info ts=2022-05-09T23:28:23.355776305Z component=api-volume-service msg="detaching volume from server" volume-id=17183505 server-id=TARGET_SERVER
level=info ts=2022-05-09T23:28:23.404601321Z component=api-volume-service msg="volume not attached to a server" volume-id=17183505
level=info ts=2022-05-09T23:29:19.749042631Z component=api-volume-service msg="attaching volume" volume-id=17183505 server-id=CURRENT_SERVER
This looks confusing because: 1) detaching is done with server-id of the allocation target client and not with the old allocation node 2) it prints some failure, not sure if this is from detaching but if so, I wonder if the previous log means trying to detach the volume only if it is attached to the specified server-id 3) attaching volume has server-id which is the old allocation client
@tgross nomad should move volume to the new client after the job has been rescheduled due to failure, right? 😆
My volumes do mount, but I cannot get nomad to reschedule them in case where client goes out, and it is even worse than before, because manual detach from the hetzner UI does not help to recover.
I have pretty much the same same setup (volume & csi plugin) and pretty much the same issue - volume doesn't get rescheduled once client goes down (1.3.0-rc1); I even have a systemd service that tries to to a clean nomad shutdown (drain node) before reboot/shutdown but that seems to have no effect 🤔
Once I have a little bit more time will try to provide more feedback, but it's good to hear (?) that I'm not the only one with same problem still 😅
Hey folks! The old node must be live and have a running plugin on it in order for Nomad to cleanly unmount the volume. You should drain a node before shutting it down, so that the tasks are rescheduled and the plugins are left in place until after all the other tasks are rescheduled. If you don't, you need to do volume detach
The old node must be live and have a running plugin on it in order for Nomad to cleanly unmount the volume.
@tgross ok, but shouldn't nomad at least recover after the client is back online? This is not the case in my current setup. I noticed some errors like so, but I'm unsure of their exact context, I think however these were logged when the failed node was turned on again:
missing external node ID: failed to find storage provider info for client "765bedd9-2897-1089-42c8-5de6f84b2110", node plugin "csi.hetzner.cloud" is not running or has not fingerprinted on this client
Although they seem to be not important, because they're probably from the time before CSI container started. I just feel like bringing the client back shouldn't prevent everything from going back to normal without doing deregister manually.
but shouldn't nomad at least recover after the client is back online?
I just feel like bringing the client back shouldn't prevent everything from going back to normal without doing deregister manually.
It shouldn't so long as the plugin is alive and healthy. You can't crash-out the client without also making sure the node plugin is live and expect Nomad can clean this situation up; Nomad can't do anything with the volumes without a plugin in place. That's a constraint of the CSI specification.
It doesn't look like the plugin is healthy given the log entry you're showing. I'm happy to revisit this issue but there's not a lot I can do with one liner log snippets. Can you provide specific steps to reproduce and the logs for client, server, and plugins? Allocation status for plugins would also help.
It shouldn't so long as the plugin is alive and healthy. You can't crash-out the client without also making sure the node plugin is live and expect Nomad can clean this situation up; Nomad can't do anything with the volumes without a plugin in place. That's a constraint of the CSI specification.
@tgross in my current cluster state all clients are online and plugins are too, volume is unmounted in the Hetzner UI, but allocation still can't take place.
I noticed that the volume is at 1 allocation, but none is shown, which would make sense considering the displayed error. Apparently I got it previously but did not expect it to be that important as it appears now (noted in previous message):
Constraint "CSI volume XYZ has exhausted its available writer claims and is claimed by a garbage collected allocation b3c9ce46-ca63-d318-efaa-a5ba6f6538fc; waiting for claim to be released": 2 nodes excluded by filter
I think this state was caused by running nomad system gc
manually whilst one client (with job previously using this volume) was turned off.
Plugin
Storage
Job
I tried this:
It appears to be stuck for now, with the other previously seen error.
nomad system gc
Apparently this made it work? I think the difference is that the state is failed
and not lost
?
After turning the client off (looks ok?):
After turning the client on (marked as failed by the client?):
After waiting some time, the job:
failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume test-mysql: rpc error: rpc error: controller publish: failed to find storage provider info for client "281c57a2-bcfd-97aa-22ab-d92a88c1af76", node plugin "csi.hetzner.cloud" is not running or has not fingerprinted on this client
After running nomad system gc
Hi @PinkLolicorn! Yeah I think the "lost" state is probably the problem here. An allocation gets marked as lost once the client has missed a heartbeat (unless group.max_client_disconnect
is set, for Nomad 1.3.0+), and for a lost client there's no route to the plugin so we can't run a disconnect at the time we try to reschedule.
When the client reconnects, we should see it marked as terminal, and that should allow for the "volume watcher" control loop to clean it up without having to wait for GC. I've got an open issue (in a private repo, I'll try to get it moved over) to have the volume watcher detect events from allocations and not just volumes, but that's going to be a bit of a lift to complete. In any case, I would expect that once the allocation moves from lost to terminal (failed or complete), we'd be able to free the claim. I'll see if I can reproduce that specific behavior in a test rig.
As you noticed though, the GC job does clean things up once it executes. The window for that is controlled by the CSIVolumeClaimGCInterval
which as you saw is 5min. That's not user-configurable the way that job_gc_threshold
but not for any particular reason other than we never got around to it. I've opened https://github.com/hashicorp/nomad/issues/12952 to expose that.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Operating system and Environment details
Ubuntu 20.04.3 LTS running on CPX11 VMs from Hetzner
Issue
Jobs using CSI volume do not recover from the client failure without human intervention and require volume to be manually detached from the client using provider's web UI. The client's storage controller/node health does not return to normal after the instance is back online. The process of moving the job to the other node works with draining (excluding system jobs) but not on the failure.
Reproduction steps
hetznercloud/hcloud-csi-driver:1.6.0
as system jobAdditional steps
*Failed allocations are gone after some time.
Expected Result
Actual Result
**After night these returned all back to running, but CSI controllers/nodes are still not considered healthy.
Job file (CSI plugin)
Job file (MySQL using CSI)
Job failure message
CSI plugin logs
Nomad Server logs
Nomad server logs (next day)
Seems like volumes watcher is still trying to detach the volume, even if the volume shows currently only one running allocation on the correct node. The log shows two errors and that would correspond to the two previously turned off clients.
Related
https://discuss.hashicorp.com/t/csi-plugin-health-after-reboot/24723 - similar issue with 2/3 healthy after reboot and "context deadline exceeded" messages (using kubernetes-csi/csi-driver-nfs/)