nomad volume deregister doesn't work without -force

hongkongkiwi commented 3 years ago

Nomad version

Nomad v0.12.5 (514b0d667b57068badb43795103fb7dd3a9fbea7)

Operating system and Environment details

Linux builder0 4.15.0-112-generic 113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

Using the nomad volume deregister command always fails with the error

Error deregistering volume: Unexpected response code: 500 (rpc error: rpc error: volume in use: builder0)

I have stopped the job using the volume and I have also successfully detached the volume (nomad volume detach) command without error.

But I cannot deregister the volume even though the job using it has stopped and the detach command was successful. The only thing that works is nomad volume deregister -force

Volume Details

{
  "access_mode": "single-node-writer",
  "attachment_mode": "block-device",
  "external_id": "<my_volume_id>",
  "id": "builder0",
  "mount_options": {
    "fs_type": "ext4",
    "mount_flags": [
      "rw"
    ]
  },
  "name": "DO Vol for builder0",
  "plugin_id": "digitalocean",
  "type": "csi"
}

Job Spec (for digitalocean plugin)

job "csi_digitalocean" {
  region = "global"
  datacenters = ["dc1"]
  type = "system"
  group "monolith" {
    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }
    constraint {
      attribute = "${attr.cpu.arch}"
      operator = "="
      value = "amd64"
    }
    constraint {
      attribute = "${attr.kernel.name}"
      operator = "="
      value     = "linux"
    }
    # Only run this on digitalocean ocean droplets
    # e.g. droplets with a droplet_id
    constraint {
      attribute = "${meta.droplet_id}"
      operator = "is_set"
    }
    # Use nomad_storage_drivers list to control which servers these are applied to
    constraint {
      attribute = "${meta.nomad_storage_drivers}"
      operator = "is_set"
    }
    constraint {
      attribute = "${meta.nomad_storage_drivers}"
      operator = "set_contains"
      value = "digitalocean"
    }
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    task "plugin" {
      driver = "docker"
      config {
    image = "digitalocean/do-csi-plugin:v1.3.0"
        privileged = true
        args = [
          "--endpoint=unix:///var/run/csi.sock",
          "--token=<do_token>",
          "--url=https://api.digitalocean.com/"
        ]
      }
      csi_plugin {
        id        = "digitalocean"
        type      = "monolith"
        mount_dir = "/var/run"
      }
      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Job Spec (for job using volume)

job "builder" {
  group "builder" {
    volume "data_mount" {
      type      = "csi"
      source    = "builder0"
      read_only = false
    }
    task "builder" {
      volume_mount {
        volume      = "data_mount"
        destination = "/data"
    read_only   = false
    }
---- end snip ----

hongkongkiwi commented 3 years ago

It seems similar to #8100 , however that shows as being fixed in v12.2, but my issue persists on v12.5, so I don't think this is a duplicate.

tgross commented 3 years ago

@hongkongkiwi does the volume have allocation claims? nomad volume status :id should list the allocations.

hongkongkiwi commented 3 years ago

Yes, it has allocation claims. These claims do not disappear when the only job using the volume job is stopped. In-fact there is no way to remove the claims, either with detach, stopping the job or any other method. They stay there and block until a -force is used.

tgross commented 3 years ago

Sorry to hear that, @hongkongkiwi. At this point I thought we'd fixed all the ways in which we could get these un-reclaimable volumes... obviously I've missed something though.

It might help me diagnose if you were to run nomad system gc and then give me logs either from the leader (if the GC executes against the volume) or one of the servers where the GC eval runs (if the GC doesn't execute against the volume).

chebelom commented 3 years ago

Hi @tgross , we are in the same situation. We noticed that sometimes nomad cannot allocate a job, lamenting that the available claims for the CSI volume are exhausted. Looking at the status of the volume we noticed that we have a zombie allocation that nomad knows nothing about (that could be due to a nomad system gc that we performed while trying to understand the situation ). In order to resolve the situation we need to force the deregistration of the volume and register it again; at this point nomad can successfully restart our jobs.

We think that this happens whenever an allocation (that uses a CSI volume) fails or it cannot stops smoothly, or the underlying node dies. Below you can find some more information.

job status placement error

Constraint CSI volume zookeeper-data-1 has exhausted its available writer claims filtered 1 node

nomad volume status zookeeper-data-1

ID                   = zookeeper-data-1
Name                 = zookeeper-data-1
External ID          = <redacted>
Plugin ID            = hcloud-volumes
Provider             = csi.hetzner.cloud
Version              = 1.5.1
Schedulable          = true
Controllers Healthy  = 9
Controllers Expected = 10
Nodes Healthy        = 9
Nodes Expected       = 10
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group   Version  Desired  Status  Created    Modified
8cafab90  0eec7d7b  zookeeper-1  1        stop     failed  6d19h ago  2h12m ago

nomad alloc status 8cafab90

No allocation(s) with prefix or id "8cafab90" found

nomad server(s) monitor

2020-10-19T14:04:27.096+0200 [DEBUG] core.sched: forced job GC
2020-10-19T14:04:27.096+0200 [DEBUG] core.sched: forced eval GC
2020-10-19T14:04:27.097+0200 [DEBUG] core.sched: forced deployment GC
2020-10-19T14:04:27.097+0200 [DEBUG] core.sched: forced plugin GC
2020-10-19T14:04:27.097+0200 [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=18446744073709551615 csi_plugin_gc_threshold=1h0m0s
2020-10-19T14:04:27.115+0200 [ERROR] core.sched: failed to GC plugin: plugin_id=hcloud-volumes error="rpc error: plugin in use"

tgross commented 3 years ago

Thanks for providing those logs @chebelom, because that helped me realize we were detecting the "plugin in use" error message incorrectly during plugin GC, which is preventing us from running volume GC when you run nomad system gc. That'll be fixed in https://github.com/hashicorp/nomad/pull/9141 (That being said, the volume GC should also run every 5m.)

Still working on the fix for the nil-alloc scenario.

tgross commented 3 years ago

WIP PR is https://github.com/hashicorp/nomad/pull/9239 and I'm looking to ship this in the 1.0 final.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad