Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

Cannot re-plan jobs due to CSI volumes being claimed. I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using nomad volume create command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.

What confuses me is the warning. It differs every time I run job plan. First I saw

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-data has exhausted its available writer claims": 2 nodes excluded by filter

Then, runnig job plan again a few seconds after, I got

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

Then again,

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 1 nodes excluded by filter
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 1 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.

This is the output of nomad volume status.

Container Storage Interface
ID                           Name                Plugin ID  Schedulable  Access Mode
zookeeper1-data     zookeeper1-data     ceph-csi   true         single-node-writer
zookeeper1-datalog  zookeeper1-datalog  ceph-csi   true         single-node-writer
zookeeper2-data     zookeeper2-data     ceph-csi   true         single-node-writer
zookeeper2-datalog  zookeeper2-datalog  ceph-csi   true         single-node-writer
zookeeper3-data     zookeeper3-data     ceph-csi   true         <none>
zookeeper3-datalog  zookeeper3-datalog  ceph-csi   true         <none>

It says that they are schedulable. This is the output of nomad volume status zookeeper1-datalog:

ID                   = zookeeper1-datalog
Name                 = zookeeper1-datalog
External ID          = 0001-0024-72f28a72-0434-4045-be3a-b5165287253f-0000000000000003-72ec315b-e9f5-11eb-8af7-0242ac110002
Plugin ID            = ceph-csi
Provider             = cephfs.nomad.example.com
Version              = v3.3.1
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

It says there, there are no allocations placed.

Reproduction steps

This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with nomad job stop -purge. No, doing nomad system gc, nomad system reconcile summary, or restarting Nomad does not work.

Expected Result

Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using nomad volume create so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).

Actual Result

See error logs above.

Job file (if appropriate)

I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):

    volume "data" {
      type = "csi"
      read_only = false
      source = "zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

All groups have count = 1.

...perhaps in the short term we could have a way to "force" detach volumes? I'm racking up quite a number of them :(

@dcarbone - check out the nomad volume deregister -force and nomad volume create flow, it's what solves my problems.

FWIW, a very reliable way of triggering this problem is to try to update a job in-place. Update a template stanza, or one of the service.tags values. When I try to do this with Terraform, there is around a 1/3 chance that it'll cause at least one of the volumes to become "stuck".

Thanks @dcarbone. This is particularly interesting because the client-side hook doesn't run at all when we do an in-place update of the allocation (which it's a different kind of bug on its own, and why I opened https://github.com/hashicorp/nomad/issues/11786). If the allocation isn't being rescheduled I wouldn't expect the CSI claims to change, so that helps narrow down where the behavior is going wrong. Or adds a new place where it's going wrong.

Ok, I've been away for a couple days but I wanted to give a status update on today's investigation.

I discovered today that our Controller RPC for CreateVolume doesn't seem to be working correctly with democratic-csi, or at least not with NFS. The directory on the NFS server isn't being created as I'd expect so the mount fails. This was the cause of the alloc failure in https://github.com/hashicorp/nomad/issues/10927#issuecomment-1013467062. I'll need to debug that as well and see whether it's at our side or the plugin's side (probably ours, tbh). But that's not strictly related to the issue here; it just happened to be what caused the initial alloc failure.
I tried to reproduce in-place updates causing the issue as @dcarbone reported, but I haven't been able to reproduce that for anything that's truly an in-place update (ex. service name update) and doesn't cause a reschedule.
There's a method in the state store code CSIVolumeDenormalizeTxn where we "denormalize" allocations for a volume by querying their current state. We backfill claims there for backwards compatibility with pre 1.1.0 state, but I noticed we're not taking advantage of that moment to ensure that any nil allocations are in the PastClaims list as well.
I was able to get an interesting reproduction where a single-node-writer job was stopped + purged, and then a new job claimed the same volume in multi-node-multi-writer mode (with 2 allocs). This resulted in the following:

$ nomad volume status csi-volume-nfs0
ID                   = csi-volume-nfs0
Name                 = csi-volume-nfs0
External ID          = csi-volume-nfs0
Plugin ID            = org.democratic-csi.nfs
Provider             = org.democratic-csi.nfs
Version              = 1.4.3
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID                                    Node ID                               Task Group  Version  Desired  Status    Created    Modified
399b8755-a1c5-2c2b-8d69-3e635947f154  ef15552c-6a06-bf35-4e6b-4cb37b624e08  web         7        stop     complete  8m47s ago  6m23s ago
93a9b426-ce6b-224c-297d-f9f0e9dc02bf  1d682e06-3776-55cb-38b6-8d53a83cb336  web         0        run      running   2m48s ago  2m37s ago
4e4fdd6e-1202-31f9-bff4-6e377672f106  b05f0de6-7cbd-dc6f-3335-b83ad15b1696  web         0        run      running   2m48s ago  2m37s ago

Note that the access and attachment mode are stuck on the old mode, and we see the old allocation floating around. Which means the stopped allocation was being held as a current write claim.

I then drained one of the nodes running an allocation, intentionally neglecting to use -ignore-system. As we'd expect (ref https://github.com/hashicorp/nomad/issues/11614), the unmounting step failed because there was no node plugin available.

The job status was very interesting after that:

$ nomad job status httpd
...

Placement Failure
Task Group "web":
  * Class "vagrant": 2 nodes excluded by filter
  * Constraint "CSI volume csi-volume-nfs0 has exhausted its available writer claims": 2 nodes excluded by filter

Latest Deployment
ID          = f698e0f7
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
web         2        2       2        0          2022-01-19T16:49:45Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
4e4fdd6e  b05f0de6  web         0        stop     complete  2h28m ago  19s ago
93a9b426  1d682e06  web         0        run      running   2h28m ago  2h28m ago

The scheduler should only be worried about multi-node-multi-writer claims, for which we have no limits. That led me to this block of code in the scheduler at feasible.go#L326:

if !vol.WriteFreeClaims() {
    // Check the blocking allocations to see if they belong to this job
    for id := range vol.WriteAllocs {
        a, err := c.ctx.State().AllocByID(ws, id)
        if err != nil || a == nil ||
            a.Namespace != c.namespace || a.JobID != c.jobID {
            return false, fmt.Sprintf(
                FilterConstraintCSIVolumeInUseTemplate, vol.ID)
        }
    }
}

The WriteFreeClaims method assumes that we only ever have one mode set, which isn't the case with our stuck claim. But then the check to make sure that this job is not blocking it's own claims is hitting that a == nil condition and failing.

The raft state for these claims looks like this:

{
  "AllocationID": "399b8755-a1c5-2c2b-8d69-3e635947f154",
  "NodeID": "ef15552c-6a06-bf35-4e6b-4cb37b624e08",
  "ExternalNodeID": "",
  "Mode": 1,
  "AccessMode": "single-node-writer",
  "AttachmentMode": "file-system",
  "State": 0
}
{
  "AllocationID": "4e4fdd6e-1202-31f9-bff4-6e377672f106",
  "NodeID": "b05f0de6-7cbd-dc6f-3335-b83ad15b1696",
  "ExternalNodeID": "",
  "Mode": 1,
  "AccessMode": "multi-node-multi-writer",
  "AttachmentMode": "file-system",
  "State": 0
}
{
  "AllocationID": "93a9b426-ce6b-224c-297d-f9f0e9dc02bf",
  "NodeID": "1d682e06-3776-55cb-38b6-8d53a83cb336",
  "ExternalNodeID": "",
  "Mode": 1,
  "AccessMode": "multi-node-multi-writer",
  "AttachmentMode": "file-system",
  "State": 0
}

My next steps are:

Write up a test that passes this volume state through both the GC and the volume reaper to verify that it's not cleaning up the stale claim as we'd expect.
Test a patch to CSIVolumeDenormalizeTxn to ensure nil allocations are in the PastClaims list.
Test a patch to the feasibility checker to ensure nil allocations that don't have corresponding live PastClaims aren't blocking write claims.
I've also got a patch going to do client-driven node unpublish and another moving claims from the client side to the plan applier, but that's probably not going to be ready till next week.
I'll keep updating this issue frequently so folks know where we're at with this. Thanks again for your patience!

Very appreciative of the thorough writeup keeping us in the loop, thank you!

Repeated messages are frowned upon on GH, but I must thank you too. For both looking into this and keeping us in the loop. I and probably most, really appreciate it. Thanks again

Excellent, thank you so much @tgross for the updates, it's extremely helpful to see the the progress on this.

I've built out a test rig that recreates the state I captured, and I've written a couple of tests that pass it through the volumewatcher and the GC job. And fortunately these tests fail with exactly the same behavior I witnessed yesterday.

I've opened https://github.com/hashicorp/nomad/pull/11890 with a draft approach. If a volume has any allocations that are nil, create a "past claim" on it when we read the volume. This approach recognizes that we may never be able to fix all causes of desync between volumes and claims, because a node could potentially just up and disappear on us. This way trying to catch events from the clients is a matter of optimization and not correctness.

In manual testing of that patch, I discovered that we were not resetting the volumewatcher's ACL token on leadership transitions. That means that even once we fix the claim state, the unpublish workflow will eventually start failing unrecoverably with errors like:

    2022-01-20T19:13:22.661Z [DEBUG] core.sched: forced volume claim GC
    2022-01-20T19:13:22.662Z [DEBUG] core.sched: CSI volume claim GC scanning before cutoff index: index=18446744073709551615 csi_volume_claim_gc_threshold=5m0s
    2022-01-20T19:13:22.664Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-volume-nfs0
  error=
  | 1 error occurred:
  |     * Permission denied
  |

Or, if you detach manually the error is particularly confusing because you might think it's talking about your ACL token:

    2022-01-20T19:12:57.559Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-volume-nfs0
  error=
  | 1 error occurred:
  |     * Permission denied
  |

Either case leaves us in a state where volume claims have been released but the volumes are still physically mounted:

sad mount outputs

``` nomad-client2:~$ mount | grep nfs 192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/staging/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.4,local_lock=none,addr=192.168.56.69) 192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/per-alloc/a1e552e3-e89d-9c8e-5e2a-0fe91a2041ec/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.4,local_lock=none,addr=192.168.56.69) 192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/per-alloc/a90a1a1e-328c-e312-42b3-6766536306ea/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.4,local_lock=none,addr=192.168.56.69) ```

I've opened https://github.com/hashicorp/nomad/pull/11891 with a fix for that.

Next steps for https://github.com/hashicorp/nomad/pull/11890:

Do some more end-to-end testing to verify that it covers us in all the cases we can recreate.
Make the CSIVolumeDenormalize method private and force all queries of volumes to hit it at the state store level, so that we ever can't miss doing so and end up playing wack-a-mole with this bug all over the code base.

I mentioned yesterday the issue I saw with CreateVolume and the democratic-csi plugin. I've opened https://github.com/hashicorp/nomad/issues/11893 to investigate that separately.

Also, yesterday I mentioned:

I've also got a patch going to do client-driven node unpublish and another moving claims from the client side to the plan applier, but that's probably not going to be ready till next week.

I made a brief attempt at moving claims into the plan applier, but it turned out to be really awful, so that doesn't seem like the right approach after all. But the client-driven node unpublish looks feasible. I have a work-in-progress patch for that at https://github.com/hashicorp/nomad/pull/11892 but it doesn't quite work yet because the unpublish workflow doesn't tolerate it yet.

I'll be away tomorrow but will be picking all this up again on Monday.

I've spent much of today testing out the patch in https://github.com/hashicorp/nomad/pull/11890. In my testing I have yet to encounter this issue again once I've applied that patch (and https://github.com/hashicorp/nomad/pull/11891) to the cluster.

However, I'm still seeing volumes left attached unexpectedly as I described in https://github.com/hashicorp/nomad/issues/10927#issuecomment-1017920830. The nomad volume detach command doesn't seem to be working either, so my next step is to figure out if this patch breaks the unpublishing workflow unexpected or whether that's a separate problem.

I still haven't figured out what's going on with the access mode when releasing claims. It should be set to <none> so that the next claim can update it, but it's not. At this point I'm reasonably confident it's not tied to this problem though, so I'm going to split that out to a new issue: https://github.com/hashicorp/nomad/issues/11921

An interesting test case I encountered was purging a job and then immediately re-running the job. I was able to produce a case where the evaluation blocked because the allocations that were claiming writes had not yet unpublished the volume. But by the time the blocked eval was evaluated, the volume was free for scheduling and everything worked as expected without intervention. It looked like this:

$ nomad job stop -purge httpd
==> 2022-01-24T13:32:36-05:00: Monitoring evaluation "fbf2c487"
    2022-01-24T13:32:36-05:00: Evaluation triggered by job "httpd"
==> 2022-01-24T13:32:37-05:00: Monitoring evaluation "fbf2c487"
    2022-01-24T13:32:37-05:00: Evaluation within deployment: "cbe85da5"
    2022-01-24T13:32:37-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-24T13:32:37-05:00: Evaluation "fbf2c487" finished with status "complete"
==> 2022-01-24T13:32:37-05:00: Monitoring deployment "cbe85da5"
  ✓ Deployment "cbe85da5" successful

    2022-01-24T13:32:37-05:00
    ID          = cbe85da5
    Job ID      = httpd
    Job Version = 2
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    web         2        2       2        0          2022-01-24T18:33:01Z

$ nomad job run ./jobs/csi/httpd.nomad
==> 2022-01-24T13:32:40-05:00: Monitoring evaluation "faae38d5"
    2022-01-24T13:32:40-05:00: Evaluation triggered by job "httpd"
==> 2022-01-24T13:32:41-05:00: Monitoring evaluation "faae38d5"
    2022-01-24T13:32:41-05:00: Evaluation within deployment: "5fcc4c00"
    2022-01-24T13:32:41-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-24T13:32:41-05:00: Evaluation "faae38d5" finished with status "complete" but failed to place all allocations:
    2022-01-24T13:32:41-05:00: Task Group "web" (failed to place 2 allocations):
      * Class "vagrant": 3 nodes excluded by filter
      * Constraint "CSI volume csi-volume-nfs0 has exhausted its available writer claims": 3 nodes excluded by filter
    2022-01-24T13:32:41-05:00: Evaluation "aa76c24a" waiting for additional capacity to place remainder
==> 2022-01-24T13:32:41-05:00: Monitoring deployment "5fcc4c00"
  ✓ Deployment "5fcc4c00" successful

    2022-01-24T13:33:04-05:00
    ID          = 5fcc4c00
    Job ID      = httpd
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    web         2        2       2        0          2022-01-24T18:42:52Z

This is actually working as intended, but the error message is confusing here. So I've added a commit onto https://github.com/hashicorp/nomad/pull/11890 that will specifically call out "hey this is a GC'd allocation and it should resolve itself momentarily".

Quick follow-up on this:

However, I'm still seeing volumes left attached unexpectedly as I described in #10927 (comment). The nomad volume detach command doesn't seem to be working either, so my next step is to figure out if this patch breaks the unpublishing workflow unexpected or whether that's a separate problem.

There's some easy ways to end up with stray mounts if you node drain without -ignore-system, but (a) we're going to fix that separately, and (b) once I restored the node plugin I was always able to nomad node detach in that case.

But I was able to eventually reproduce this by repeatedly purging a job and immediately re-running it with the access mode changed. The goal was to make it so that publishing was happening at roughly the same time we were trying to detach the volume. This ended up with a client that had the mount for a previous version of the job, but no alloc:

$ mount | grep nfs
192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/staging/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.2,local_lock=none,addr=192.168.56.69)
192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.2,local_lock=none,addr=192.168.56.69)

Here are the logs for the NodePublishVolume and NodeUnpublishVolume workflows on the node plugin:

NodePublishVolume

```json {"level":"info","message":"new request - driver: ControllerNfsClientDriver method: NodeStageVolume call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.42.0\"]},\"flags\":0},\"request\":{\"publish_context\":{},\"secrets\":\"redacted\",\"volume_context\":{\"share\":\"/var/nfs/general/v/csi-volume-nfs0\",\"provisioner_driver\":\"nfs-client\",\"node_attach_driver\":\"nfs\",\"server\":\"192.168.56.69\"},\"volume_id\":\"csi-volume-nfs0\",\"staging_target_path\":\"/csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer\",\"volume_capability\":{\"access_mode\":{\"mode\":\"SINGLE_NODE_WRITER\"},\"mount\":{\"mount_flags\":[\"noatime\"],\"fs_type\":\"\",\"volume_mount_group\":\"\"},\"access_type\":\"mount\"}}}","service":"democratic-csi"} {"level":"debug","message":"operation lock keys: [\"volume_id_csi-volume-nfs0\"]","service":"democratic-csi"} {"level":"verbose","message":"validating capabilities: [{\"access_mode\":{\"mode\":\"SINGLE_NODE_WRITER\"},\"mount\":{\"mount_flags\":[\"noatime\",\"defaults\"],\"fs_type\":\"\",\"volume_mount_group\":\"\"},\"access_type\":\"mount\"}]","service":"democratic-csi"} executing filesystem command: stat /csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer executing mount command: findmnt --source 192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 --mountpoint /csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer --output source,target,fstype,label,options -b -J --nofsroot executing mount command: mount -t nfs -o noatime,defaults 192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 /csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer {"level":"info","message":"new response - driver: ControllerNfsClientDriver method: NodeStageVolume response: {}","service":"democratic-csi"} {"level":"info","message":"new request - driver: ControllerNfsClientDriver method: NodePublishVolume call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.42.0\"]},\"flags\":0},\"request\":{\"publish_context\":{},\"secrets\":\"redacted\",\"volume_context\":{\"node_attach_driver\":\"nfs\",\"server\":\"192.168.56.69\",\"share\":\"/var/nfs/general/v/csi-volume-nfs0\",\"provisioner_driver\":\"nfs-client\"},\"volume_id\":\"csi-volume-nfs0\",\"staging_target_path\":\"/csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer\",\"target_path\":\"/csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer\",\"volume_capability\":{\"access_mode\":{\"mode\":\"SINGLE_NODE_WRITER\"},\"mount\":{\"mount_flags\":[\"noatime\"],\"fs_type\":\"\",\"volume_mount_group\":\"\"},\"access_type\":\"mount\"},\"readonly\":false}}","service":"democratic-csi"} {"level":"debug","message":"operation lock keys: [\"volume_id_csi-volume-nfs0\"]","service":"democratic-csi"} executing filesystem command: stat /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer failed to execute filesystem command: stat /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer, response: {"code":1,"stdout":"","stderr":"stat: cannot stat '/csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer': No such file or directory\n","timeout":false} executing filesystem command: mkdir -p -m 0750 /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer executing mount command: findmnt --mountpoint /csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer --output source,target,fstype,label,options -b -J --nofsroot executing mount command: findmnt --mountpoint /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer --output source,target,fstype,label,options -b -J --nofsroot executing mount command: mount --bind -o noatime,defaults /csi/staging/csi-volume-nfs0/rw-file-system-single-node-writer /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer {"level":"info","message":"new response - driver: ControllerNfsClientDriver method: NodePublishVolume response: {}","service":"democratic-csi"} ```

NodeUnpublishVolume

```json {"level":"info","message":"new request - driver: ControllerNfsClientDriver method: NodeUnpublishVolume call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.42.0\"]},\"flags\":0},\"request\":{\"volume_id\":\"csi-volume-nfs0\",\"target_path\":\"/csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-multi-node-multi-writer\"}}","service":"democratic-csi"} {"level":"debug","message":"operation lock keys: [\"volume_id_csi-volume-nfs0\"]","service":"democratic-csi"} executing mount command: findmnt --mountpoint /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-multi-node-multi-writer --output source,target,fstype,label,options -b -J --nofsroot executing filesystem command: stat /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-multi-node-multi-writer failed to execute filesystem command: stat /csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-multi-node-multi-writer, response: {"code":1,"stdout":"","stderr":"stat: cannot stat '/csi/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-multi-node-multi-writer': No such file or directory\n","timeout":false} {"level":"info","message":"new response - driver: ControllerNfsClientDriver method: NodeUnpublishVolume response: {}","service":"democratic-csi"} {"level":"info","message":"new request - driver: ControllerNfsClientDriver method: NodeUnstageVolume call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.42.0\"]},\"flags\":0},\"request\":{\"volume_id\":\"csi-volume-nfs0\",\"staging_target_path\":\"/csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer\"}}","service":"democratic-csi"} {"level":"debug","message":"operation lock keys: [\"volume_id_csi-volume-nfs0\"]","service":"democratic-csi"} executing mount command: findmnt --mountpoint /csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer/block_device --output source,target,fstype,label,options -b -J --nofsroot executing mount command: findmnt --mountpoint /csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer --output source,target,fstype,label,options -b -J --nofsroot executing mount command: findmnt --mountpoint /csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer --output source,target,fstype,label,options -b -J --nofsroot executing filesystem command: stat /csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer failed to execute filesystem command: stat /csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer, response: {"code":1,"stdout":"","stderr":"stat: cannot stat '/csi/staging/csi-volume-nfs0/rw-file-system-multi-node-multi-writer': No such file or directory\n","timeout":false} {"level":"info","message":"new response - driver: ControllerNfsClientDriver method: NodeUnstageVolume response: {}","service":"democratic-csi"} ```

What we can see here is that the volume was mounted at rw-file-system-single-node-writer but when we're unpublishing we're doing so at rw-file-system-multi-node-multi-writer. So we'll never be able to unmount this because we're trying to unmount it at the wrong location. This is also why trying to detach it via nomad node detach results in a silent failure. We see the log for the HTTP request and then nothing else. We just aren't logging this code path.

So there's a race where a change in the access mode can land on the client during unpublish. There's a few obvious places that bug can be happening, so I'll pick that up again tomorrow.

Ok, the race described above turns out to be pretty straightforward. In the client's (*csiHook) Postrun() method, we make an unpublish RPC that includes a claim in the CSIVolumeClaimStateUnpublishing state and using the mode from the client. But then in the (*CSIVolume) Unpublish RPC handler, we query the volume from the state store (because we only get an ID from the client). And when we make the client RPC for the node unpublish step, we use the current volume's view of the mode (ref nodeUnpublishVolumeImpl). If the volume's mode has been changed before the old allocations can have their claims released, then we end up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in (*CoreScheduler) csiVolumeClaimGC doesn't have a mode. Instead it just writes a claim in the unpublishing state to ensure the volumewatcher detects a "past claim" change and reaps all the claims on the volumes.

With the patch in https://github.com/hashicorp/nomad/pull/11890 we should be able to safely ignore the case of the GC claim because all the past claims will be correct, so it should be safe to update nodeUnpublishVolumeImpl to always use the claims mode (maybe falling back to the volume's mode if unset?).
We'll also fix most cases of this via https://github.com/hashicorp/nomad/pull/11892 because we'll drive the node unpublish step from the client which has the "real" view of what mode the volume was actually mounted with.
I'll take a quick pass at auditing other uses of vol.AccessMode to ensure we're not making this mistake elsewhere.

At this point I'm feeling pretty confident https://github.com/hashicorp/nomad/pull/11890 fixes the primary problem with the state store. I've just marked that as ready for review.
I've pulled out https://github.com/hashicorp/nomad/pull/11932 for the race condition fix
That will unblock the last bit of code I need to finish up https://github.com/hashicorp/nomad/pull/11892
I've also pulled out https://github.com/hashicorp/nomad/pull/11931 which includes that patch but refactors the "past claims" logic so that we can be sure the core GC job and the volumewatcher are using the same code paths as other consumers of the state store.

I've done some testing of https://github.com/hashicorp/nomad/pull/11892 and that's in good shape as well, and I've walked some team mates through the rest of the PRs so that we can get them reviewed and merged one-by-one. Unfortunately I've run into a non-CSI issue around the stop_after_client_disconnect flag https://github.com/hashicorp/nomad/issues/11943 which is going to impact a lot of CSI use cases (we created that flag specifically for stateful workloads). But we should be able to ship an incremental patch release with a bunch of the work done so far. Stay tuned for updates on when that may land as the PRs get merged.

The 4 patches for this issue (which also covers https://github.com/hashicorp/nomad/issues/10052 and https://github.com/hashicorp/nomad/issues/10833 and https://github.com/hashicorp/nomad/issues/8734) have been merged to main. We're working up a plan for a patch release shortly, at which point I'll close out this issue. Note this won't yet fix the issues in #8609 but it does make that an automatically recoverable situation. I'm going to be focusing heavily on knocking out the rest of the major CSI issues over the next month or two after we've shipped this.

Thanks again for your patience, folks.

We thank you dude! I can finally recommend nomad to people without a BUT, i've always wanted be able to do that.

This is awesome work @tgross! Highly appreciate the transparency and the depth at which you kept the community updated, this thread should become the gold standard for Organization <--> Community interactions!

Nomad 1.2.5 has shipped with the patches described here. We still have a few important CSI issues to close out before we can call CSI GA-ready, but it should be safe to close this issue now. Please let us know if you run into the issue with the current versions of Nomad. Thanks!

@tgross I recently upgrade to v1.2.6 and I'm still hitting the issue.

After a few days of the job running, attempting to deploy ends up with the dreaded "CSI volume has exhausted its available writer claims and is claimed by a garbage collected allocation". Followed by a nomad volume deregister --force ritual.

@m1keil the problem isn't entirely eliminated until 1.3.0 (which will get backported). https://github.com/hashicorp/nomad/pull/12113 should cover the remaining cases.

@tgross I have still managed to reproduce with v1.3.0-beta.1 I think this could be related to a firewall issue I had. The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here? Maybe the volume allocation should be released after a long timeout has expired?

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
registry    1       0         0        0       0         0     0

Placement Failure
Task Group "registry":
  * Constraint "CSI volume ceph-registry has exhausted its available writer claims and is claimed by a garbage collected allocation 45c884a0-e629-b180-eb44-c3c7a5d5e7b7; waiting for claim to be released": 3 nodes excluded by filter

Latest Deployment
ID          = 687ebec8
Status      = running
Description = Deployment is running

# nomad volume status ceph-registry
ID                   = ceph-registry
Name                 = ceph-registry
External ID          = 0001-0024-22b9524c-4a2d-4a56-9f3d-221ee05905a2-0000000000000001-af535f2b-9f9a-11ec-bf1c-0242ac110003
Plugin ID            = ceph-csi
Provider             = cephfs.csi.ceph.com
Version              = v3.6.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 4
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here?

A bit more detail would be helpful. Which task allocation do you mean? The plugin or the allocation that mounted the volume? I would not have expected the allocation that mounted the volume to be actually gone because we block for a very long time retrying the unmount unless the client node was also shut down as well.

Did the node plugin for that node get restored to service? If so, the expectation is that the volumewatcher loop will eventually be able to detach it. Or is the client host itself gone too?

Maybe the volume allocation should be released after a long timeout has expired?

Unfortunately many storage providers will return errors if we do this. For example, imagine an AWS EBS volume attached to an EC2 instance where the Nomad client is shut down. We can never physically detach the volume until the client is restored, so all attempts to attach it elsewhere will fail unrecoverably. That's why we provide a nomad volume detach and nomad volume delete command, so that the operator can signal that the volume was detached out-of-band.

@tgross I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.

During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.

The network issue was eventually fixed, but the volume remained stuck.

That's why we provide a nomad volume detach and nomad volume delete command, so that the operator can signal that the volume was detached out-of-band.

I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information. For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?

But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether. In my case I needed to deregister/reregister the volume again.

Also the deregister/reregister dance is impossible if the volume has been created with terraform, as the volume will be deleted automatically if the resource is removed.

I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.

During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.

Ok, so that would also have restarted the Node plugin tasks as well. When the node restarted, did the Node plugin tasks come back? The UI is maybe a little misleading here -- it shows the plugin as healthy if we can schedule volumes with it, not when 100% of plugins are healthy (otherwise we'd block volume operations during plugin deployments).

As for being stuck for 4 days... the evaluation for the rescheduled task hits a retry limit fairly quickly, so I suspect it's not "stuck" so much as "gave up." Once we've got the claim taken care of, re-running the job (or using nomad job eval) should force a retry. But let's figure out the claim problem first. 😀

I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information.

It blocks because it's a synchronous operation; it has to make RPC calls to the Node plugin and to the Controller plugin. You should see logs on the Node plugin (first) and then the Controller plugin (second). If both the Node plugin and Controller plugin are live, that should work and you should be able to look at logs in both plugins (and likely the Nomad leader as well) to see what's going on there.

For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?

So long as it restarted and has a running Node plugin, yes that should still work just fine. But fair point about knowing which node the volume was attached to previously.

(hit submit too soon 😀 )

But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether.

It's not possible for us to unconditionally detach but in theory we could unconditionally free a claim. That's what deregister -force is designed to do, but yeah I realize that's not going to work out so well with Terraform.

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad