Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.82k stars 1.94k forks source link

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Closed gregory112 closed 2 years ago

gregory112 commented 3 years ago

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

Cannot re-plan jobs due to CSI volumes being claimed. I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using nomad volume create command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.

What confuses me is the warning. It differs every time I run job plan. First I saw

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-data has exhausted its available writer claims": 2 nodes excluded by filter

Then, runnig job plan again a few seconds after, I got

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

Then again,

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 1 nodes excluded by filter
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 1 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.

This is the output of nomad volume status.

Container Storage Interface
ID                           Name                Plugin ID  Schedulable  Access Mode
zookeeper1-data     zookeeper1-data     ceph-csi   true         single-node-writer
zookeeper1-datalog  zookeeper1-datalog  ceph-csi   true         single-node-writer
zookeeper2-data     zookeeper2-data     ceph-csi   true         single-node-writer
zookeeper2-datalog  zookeeper2-datalog  ceph-csi   true         single-node-writer
zookeeper3-data     zookeeper3-data     ceph-csi   true         <none>
zookeeper3-datalog  zookeeper3-datalog  ceph-csi   true         <none>

It says that they are schedulable. This is the output of nomad volume status zookeeper1-datalog:

ID                   = zookeeper1-datalog
Name                 = zookeeper1-datalog
External ID          = 0001-0024-72f28a72-0434-4045-be3a-b5165287253f-0000000000000003-72ec315b-e9f5-11eb-8af7-0242ac110002
Plugin ID            = ceph-csi
Provider             = cephfs.nomad.example.com
Version              = v3.3.1
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

It says there, there are no allocations placed.

Reproduction steps

This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with nomad job stop -purge. No, doing nomad system gc, nomad system reconcile summary, or restarting Nomad does not work.

Expected Result

Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using nomad volume create so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).

Actual Result

See error logs above.

Job file (if appropriate)

I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):

    volume "data" {
      type = "csi"
      read_only = false
      source = "zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

All groups have count = 1.

bfqrst commented 3 years ago

[Nomad 1.1.2]

I have similar issues using the AWS EBS plugin. And I wholeheartedly share the sentiment that it's really hard to reproduce or debug, which is why I've been reluctant to opening an issue. In my case Nomad workers are spot instances which come and go. Non CSI jobs get rescheduled fine, but those which have CSI volumes attached tend to go red on the first reschedule and then eventually succeed. It's almost like the CSI mechanism needs more time to do it's thing, before the job gets restarted...

gregory112 commented 3 years ago

I have been investigating it for a while. A manual solution that works is to deregister the volume with -force, or at least detaching the volume with nomad volume detach. Deregistering means you need to register the volume again and if you used nomad volume create as I did, you will have to get the external ID and reregister the volume again with the same external ID. Now nomad volume detach is probably the most straightforward approach. I learn that what it actually do is to unmount the mount point of the particular volume in the particular node. And this sometimes hang. When hanging, I see an umount process in the node hanging indefinitely. Network connectivity is never reliable, so I think cases like this needs attention.

I don't know what Nomad does with unused volumes after the job is stopped. Does Nomad instruct the plugin to unmount the volumes from nodes? What if it fails to do so then? Maybe a timeout, or more log messages, or anything should be added in nomad alloc status.

umount hangs often especially in the case of mounting network file systems like this. And when umount hangs, nomad volume detach hangs too. The solution in my case is to actually kill the umount process in nodes and retry with umount -f, but I don't know if Nomad can do this, as I think this might be managed by the CSI plugin. But due to the variations that exist in a lot different CSI plugins, I think cases like this should be handled by Nomad somehow.

konsti commented 3 years ago

We have the very same issue with Nomad 1.1.3 and AWS EBS CSI Plugin v1.2.0. We need to revert to host_volumes as this issue threatens our cluster health when important stateful services like Prometheus can't restart correctly.

The volume is unmounted currently in AWS and shows state = "Available", so it's clearly an internal Nomad state issue.

gregory112 commented 3 years ago

Well this is bad, as I don't think the issue is solved by Nomad garbage collector. When a volume hangs in one Nomad node, Nomad will just allocate the job to another node. And when it hangs there, it will allocate the job to another node again. Imagine if we have 100 Nomad nodes and the volumes stuck in some of them.

JanMa commented 3 years ago

I am seeing the same issues using the gcp-compute-persistent-disk-csi-driver plugin and Nomad v1.1.1. Interestingly the Read Volume API endpoint shows the correct amount of writers but the List Volumes endpoint shows too many CurrentWriters. I guess this is what's stopping the reassignment of the volume

Thunderbottom commented 3 years ago

We've faced this issue as well. Seems like Nomad fails to realise that there are not allocations present for the volume.

From what we could make out, both AWS and the CSI plugin report that the volume is available to mount:

$ nomad volume status -verbose -namespace=workspace workspace
ID                   = workspace
Name                 = workspace
External ID          = vol-xxxxxxxxxxxxx
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.10.1
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = workspace

And the AWS console reports the volume is "Available".

Whereas in the Nomad UI, the "Storage" tab reports multiple #Allocs for the aforementioned volume, and clicking on the volume shows that there are neither read nor write mounts, which potentially points to my hunch that Nomad somehow isn't aware that the volume has been detached by the driver and is not in use by any of the allocations. Here's a gist of logs from the CSI controller for the same:

I0816 06:26:49.482851       1 controller.go:329] ControllerUnpublishVolume: called with args {VolumeId:vol-XXXXXXXXXX NodeId:i-YYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0816 06:26:56.194278       1 controller.go:346] ControllerUnpublishVolume: volume vol-XXXXXXXXXX detached from node i-YYYYYYYYYYY

One thing I have noticed is that on a newly registered CSI volume, the metadata shows up as:

Access Mode          = <none>
Attachment Mode      = <none>

And while on the volume that is "Schedulable" (not really, in terms of Nomad) it shows up as:

Access Mode          = single-node-writer
Attachment Mode      = file-system

Could be pointing to something, perhaps?

The only solution that has worked for us so far is to -force deregister the volume and re-register it. Would love to have this issue solved. Let me know if I can help with any more information.

lgfa29 commented 3 years ago

Thank you all for the detailed information (keep them coming if you have more!).

It seems like this issue happens sporadically when allocations that use a volume are restarted. I will try to create a high churn environment and see if I can reproduce it.

lgfa29 commented 3 years ago

Hi everyone, just a quick update. I've been running a periodic job for a couple of days now, and so far I haven't seen any issues.

In hindsight I should've probably used a service job since they are likely more common they have some different scheduling logic when compared to batch.

For those who have seen this issue, do you have any kind of update block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.

Thanks!

JanMa commented 3 years ago

Hey @lgfa29 you can try out the following job. It's a basic prometheus deployment without any special update block. I regularly have issues when trying to update this for, for example to add more resources.

job "prometheus" {
  datacenters = ["dc1"]
  type = "service"

  group "monitoring" {
    count = 2

    constraint {
      operator = "distinct_hosts"
      value = "true"
    }

    volume "data" {
      type            = "csi"
      source          = "prometheus-disk"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      per_alloc       = true
    }

    network {
      port "http" {
        static = 9090
      }
    }

    service {
      name = "prometheus2"
      tags = ["prometheus2"]
      task = "prometheus"
      port = "http"

      check {
        type = "http"
        port = "http"
        path = "/-/ready"
        interval = "10s"
        timeout = "5s"
      }
    }

    task "prometheus" {
      driver = "docker"
      user = "root"

      volume_mount {
        volume      = "data"
        destination = "/prometheus"
      }

      resources {
        memory = 1024
        cpu = 1024
      }

      template {
        data = <<EOT
---
global:
  scrape_interval: 10s
  external_labels:
    __replica__: "{{ env "NOMAD_ALLOC_ID" }}"
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 10s
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services:
          - prometheus2
    relabel_configs:
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
  # Nomad metrics
  - job_name: "nomad_metrics"
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services: ["nomad-client", "nomad"]
    relabel_configs:
      - source_labels: ["__meta_consul_tags"]
        regex: "(.*)http(.*)"
        action: keep
      - source_labels: ["__meta_consul_service"]
        regex: "(.*)"
        target_label: "job"
        replacement: "$1"
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
    scrape_interval: 5s
    metrics_path: /v1/metrics
    params:
      format: ["prometheus"]
  # Consul metrics
  - job_name: "consul_metrics"
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services: ["consul-agent"]
    relabel_configs:
      - source_labels: ["__meta_consul_tags"]
        regex: "(.*)http(.*)"
        action: keep
      - source_labels: ["__meta_consul_service"]
        regex: "(.*)"
        target_label: "job"
        replacement: "$1"
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
    scrape_interval: 5s
    metrics_path: /v1/agent/metrics
    params:
      format: ["prometheus"]
EOT
        destination = "local/prometheus.yml"
      }
      config {
        image = "quay.io/prometheus/prometheus"
        ports = ["http"]
        args = [
          "--config.file=${NOMAD_TASK_DIR}/prometheus.yml",
          "--log.level=info",
          "--storage.tsdb.retention.time=1d",
          "--storage.tsdb.path=/prometheus",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles"
        ]
      }
    }
  }
}

gregory112 commented 3 years ago

For those who have seen this issue, do you have any kind of update block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.

Mine has.

  update {
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "20m"
  }

Also you can try to fail the job. For example due to pull fail or running fail to the point the deployment exceeds deadline. This is what really makes the error frequently show up. I also have service block set up to provide health checks.

Yes, please try with service job. What made the error show up in my case is task failure, and stopping and replanning the task. Replanning without stopping can sometimes cause the issue to appear too, for example updating the image version used by the job.

Also I have two master nodes. Don't know if this contributes to this but does anyone use two master nodes too for this? Or maybe more? Maybe a race condition error between master nodes? Split brain perhaps?

Also, here's mine

variables {
  mycluster_zookeeper_image = "zookeeper:3.6"
}

job "mycluster-zookeeper" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  update {
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "20m"
  }

  group "zookeeper1" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper1-peer"
      port = "2888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper1-leader"
      port = "3888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper1-client"
      port = "2181"
      address_mode = "alloc"
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper1-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper1-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:0.0.0.0:2888:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "1"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }

  group "zookeeper2" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper2-peer"
      port = "2888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper2-leader"
      port = "3888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper2-client"
      port = "2181"
      address_mode = "alloc"
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:0.0.0.0:2888:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "2"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }

  group "zookeeper3" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper3-peer"
      port = "2888"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #
      #    proxy {
      #      upstreams {
      #        destination_name = "mycluster-zookeeper1-peer"
      #        local_bind_port = "12888"
      #      }
      #      upstreams {
      #        destination_name = "mycluster-zookeeper2-peer"
      #        local_bind_port = "22888"
      #      }
      #    }
      #  }
      #}
    }

    service {
      name = "mycluster-zookeeper3-leader"
      port = "3888"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #
      #    proxy {
      #      upstreams {
      #        destination_name = "mycluster-zookeeper1-leader"
      #        local_bind_port = "13888"
      #      }
      #      upstreams {
      #        destination_name = "mycluster-zookeeper2-leader"
      #        local_bind_port = "23888"
      #      }
      #    }
      #  }
      #}
    }

    service {
      name = "mycluster-zookeeper3-client"
      port = "2181"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #    #proxy {
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper1-client"
      #    #    local_bind_port = "12181"
      #    #  }
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper2-client"
      #    #    local_bind_port = "22181"
      #    #  }
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper3-client"
      #    #    local_bind_port = "32181"
      #    #  }
      #    #}
      #  }
      #}
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper3-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper3-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:0.0.0.0:2888:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "3"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }
}

And my hcl file used to create all the volumes with nomad volume create:

id = "mycluster-zookeeper1-data"
name = "zookeeper1-data"
type = "csi"
plugin_id = "ceph-csi"
capacity_min = "1G"
capacity_max = "10G"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
  mount_flags = ["noatime"]
}

secrets {
  adminID = "redacted"
  adminKey = "redacted"
}

parameters {
  clusterID = "redacted"
  fsName = "myfsname"
  mounter = "kernel"
}

lgfa29 commented 3 years ago

Thanks for the sample jobs @JanMa and @gregory112.

It seems like update is not part of the problem, so I create this job that has a 10% chance of failing. I will leave it running and see if the issue is triggered.

job "random-fail" {
  datacenters = ["dc1"]
  type        = "service"

  group "random-fail" {
    volume "ebs-vol" {
      type            = "csi"
      read_only       = false
      source          = "ebs-vol"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "random-fail" {
      driver = "docker"

      config {
        image   = "alpine:3.14"
        command = "/bin/ash"
        args    = ["/local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

while true;
do
  echo "Rolling the dice..."

  n=$(($RANDOM % 10))
  echo "Got ${n}!"

  if  [[ 0 -eq ${n} ]];
  then
    echo "Bye :wave:"
    exit 1;
  fi

  echo "'Til the next round."
  sleep 10;
done
EOF

        destination = "local/script.sh"
      }

      volume_mount {
        volume      = "ebs-vol"
        destination = "/volume"
        read_only   = false
      }
    }
  }
}

raven-oscar commented 3 years ago

For some reason nomad thinks that volume is still in use while it is not. nomad volume deregister returns "Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: nessus)". nomad volume deregister -force followed by nomad system gc and then registering it again seems to help.

lgfa29 commented 3 years ago

I think I was able to reproduce this with a self-updating job:

job "update" {
  datacenters = ["dc1"]
  type        = "service"

  group "update" {
    volume "ebs-vol" {
      type            = "csi"
      read_only       = false
      source          = "ebs-vol2"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "update" {
      driver = "docker"

      config {
        image   = "alpine:3.14"
        command = "/bin/sh"
        args    = ["/local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

apk add curl jq

while true;
do
  sleep 10
  jobspec=$(curl http://172.17.0.1:4646/v1/job/${NOMAD_JOB_ID})
  cpu=$(echo $jobspec | jq '.TaskGroups[0].Tasks[0].Resources.CPU')
  if [ $cpu -eq 500 ]
  then
    cpu=600
  else
    cpu=500
  fi
  new_jobspec=$(echo $jobspec | jq ".TaskGroups[0].Tasks[0].Resources.CPU = ${cpu}")
  echo $new_jobspec | jq '{"Job":.}' | curl -H "Content-Type: application/json" -X POST --data @- http:/172.17.0.1:4646/v1/job/${NOMAD_JOB_ID}
done
EOF

        destination = "local/script.sh"
      }

      volume_mount {
        volume      = "ebs-vol"
        destination = "/volume"
        read_only   = false
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

This job will usually fail first, and then work.

It's not exactly the same error message, but maybe it's related:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Could not attach volume "vol-0fa6e88e4c3b55d29" to node "i-02396ab34ed318944": attachment of disk "vol-0fa6e88e4c3b55d29" failed, expected device to be attached but was detaching

Looking at the volume details page, I see the previous allocation still listed under Write Allocations for a while, even after it's marked as Complete. I don't know if this is related, since another alloc is able to start, but may be be a pointer in the right direction

NathanFlurry commented 3 years ago

This nasty workaround seems to be working for DigitalOcean. If your task restarts frequently, it will spam your cluster with jobs, so be careful with that.

your-job.nomad.tpl:

# ...
        task "reregister_volume" {
            lifecycle {
                hook = "poststop"
                sidecar = false
            }

            driver = "docker"

            config {
                image = "alpine:3.14"
                entrypoint = [
                    "/bin/sh",
                    "-eufc",
                    <<-EOF
                    apk add curl

                    curl --fail --data '@-' -X POST \
                        "http://$${attr.unique.network.ip-address}:4646/v1/job/reregister-volume/dispatch" <<-EndOfData
                        {
                            "Meta": {
                                "csi_id": "${csi_id_prefix}[$${NOMAD_ALLOC_INDEX}]",
                                "csi_id_uri_component": "${csi_id_prefix}%5B$${NOMAD_ALLOC_INDEX}%5D",
                                "volume_name": "${volume_name_prefix}$${NOMAD_ALLOC_INDEX}",
                                "volume_plugin_id": "${volume_plugin_id}"
                            }
                        }
                        EndOfData
                    EOF
                ]
            }
        }
# ...

register-volume.nomad.tpl:

job "reregister-volume" {
    type = "batch"

    parameterized {
        payload = "forbidden"
        meta_required = ["csi_id", "csi_id_uri_component", "volume_name", "volume_plugin_id"]
    }

    group "reregister" {
        task "reregister" {
            driver = "docker"

            config {
                image = "alpine:3.14"
                entrypoint = [
                    "/bin/sh",
                    "-eufc",
                    <<-EOF
                    sleep 5 # Wait for job to stop

                    apk add jq curl

                    echo "CSI_ID=$${NOMAD_META_CSI_ID}"
                    echo "CSI_ID_URI_COMPONENT=$${NOMAD_META_CSI_ID_URI_COMPONENT}"
                    echo "VOLUME_NAME=$${NOMAD_META_VOLUME_NAME}"
                    echo "VOLUME_PLUGIN_ID=$${NOMAD_META_VOLUME_PLUGIN_ID}"

                    n=0
                    until [ "$n" -ge 15 ]; do
                        echo "> Checking if volume exists (attempt $n)"
                        curl --fail -X GET \
                            "http://$${attr.unique.network.ip-address}:4646/v1/volumes?type=csi" \
                            | jq -e '. | map(.ID == "$${NOMAD_META_CSI_ID}") | any | not' && break
                        n=$((n+1))

                        sleep 1
                        echo
                        echo '> Force detachign volume'
                        curl --fail -X DELETE \
                            "http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}?force=true" \
                            || echo '  Detaching failed'
                    done
                    if [ "$n" -ge 15 ]; then
                        echo '  Deregister failed too many times, giving up'
                        exit 0
                    else
                        echo '  Deregister complete'
                    fi

                    echo
                    echo '> Fetching external volume ID'
                    VOLUME_JSON=$(
                        curl --fail -X GET \
                            -H "Authorization: Bearer ${digitalocean_token_persistent}" \
                            "https://api.digitalocean.com/v2/volumes?name=$${NOMAD_META_VOLUME_NAME}" \
                            | jq '.volumes[0]'
                    )
                    VOLUME_ID=$(
                        echo "$VOLUME_JSON" | jq -r '.id'
                    )
                    VOLUME_REGION=$(
                        echo "$VOLUME_JSON" | jq -r '.region.slug'
                    )
                    VOLUME_DROPLET_ID=$(
                        echo "$VOLUME_JSON" | jq -r '.droplet_ids[0] // empty'
                    )
                    echo "VOLUME_ID=$VOLUME_ID"
                    echo "VOLUME_REGION=$VOLUME_ID"
                    echo "VOLUME_DROPLET_ID=$VOLUME_DROPLET_ID"

                    if [ ! -z "$VOLUME_DROPLET_ID" ]; then
                        echo
                        echo '> Detaching volume on DigitalOcean'
                        curl --fail -X POST \
                            -H "Authorization: Bearer ${digitalocean_token_persistent}" \
                            -d "{\"type\": \"detach\", \"droplet_id\": \"$VOLUME_DROPLET_ID\", \"region\": \"$VOLUME_REGION\"}" \
                            "https://api.digitalocean.com/v2/volumes/$VOLUME_ID/actions"
                    fi

                    echo
                    echo '> Re-registering volume'
                    curl --fail --data '@-' -X PUT \
                        "http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}" <<-EndOfVolume
                        {
                            "Volumes": [
                                {
                                    "ID": "$${NOMAD_META_CSI_ID}",
                                    "Name": "$${NOMAD_META_VOLUME_NAME}",
                                    "ExternalID": "$VOLUME_ID",
                                    "PluginID": "$${NOMAD_META_VOLUME_PLUGIN_ID}",
                                    "RequestedCapabilities": [{
                                        "AccessMode": "single-node-writer",
                                        "AttachmentMode": "file-system"
                                    }]
                                }
                            ]
                        }
                        EndOfVolume

                    echo
                    echo '> Reading volume'
                    curl --fail -X GET \
                        "http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}"

                    echo
                    echo 'Finished'
                    EOF
                ]
            }
        }
    }
}

latifrons commented 3 years ago

Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3) It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.

One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.

aroundthfur commented 3 years ago

[Nomad v1.1.3]

We have the exact same issue with the Linstor CSI driver (https://linbit.com/linstor/). Our use-case is to be able to fail-over a job when one server or a DC dies. We are still in the implementation phase so we were intentionally killing servers where nomad jobs are running to reproduce the fail-over use-cases.

Now, it works a few times and nomad does a proper job, but after a random number of re-schedules it fails with the same error message and the same behavior as described in this issue. In the Nomad UI volume list the volume has an allocation set, but when we go into the detailed view there is no allocs present. Basically the same behavior that @Thunderbottom reported.

MagicRB commented 3 years ago

Yep, I'm running a really basic setup with just one server/client and one client and this happens to me too, I need to -force deregister the volume and then re-register it to fix it

latifrons commented 3 years ago

Just a kind reminder for those who are using AWS EFS: Make sure you are using the suggested mount options given by AWS. https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html

especially noresvport flag.

Or it will cause infinite time wait io (100% wa for top) once tcp connection breaks.

We didn't observe any more problem after correcting the mount options.

Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3) It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.

One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.

tomiles commented 3 years ago

Same issue on our v1.1.4 cluster using Ceph RBD and cephfs CSI plugins

gregory112 commented 3 years ago

This may not be a network issues, as even though I deployed Ceph and Nomad on the same node, mounting to localhost, the issue still persist. nomad detach works fine without any problem as network issues are now out of the way. I also used a single Nomad server, but this issue still exist. The weird thing about it is when I have three instances (zoo1, zoo2, zo3) all asking for their own volumes, only some volumes (not all) have issues. This could be a concurrency problem? Race condition perhaps?

crestonbunch commented 3 years ago

I've noticed my AWS CSI plugins (both the node and controllers) are frequently restarting maybe 3-4 times a day. I don't know why yet. I suspect this sometimes leaves the volumes in a bad state and causes this error.

keslerm commented 3 years ago

I'm hitting the same thing using seaweedfs. Ultimately i think a network issue was causing the volumes to disconnect but nomad was never releasing them internally until i deregistered/reregistered them

gregory112 commented 2 years ago

Any update on this? @lgfa29

vgeddes commented 2 years ago

Same issue on our v1.1.5 cluster using AWS EBS volumes. We have to manually force deregister and register the affected volumes. Happens very frequently.

Unfortunately it means we cannot automate Nomad deployments in a CI/CD stack, as a human being needs to be present to fix any stuck volumes.

lgfa29 commented 2 years ago

Any update on this? @lgfa29

Hi @gregory112, no updates yet. I will let you know once we do have any progress to share here.

JanMa commented 2 years ago

I have been digging through the Nomad source code and found a possible work-around that works well for my use-case. If the volume gets registered as a multi-node-multi-writer volume instead of a single-node-writer volume, the parts of the source code that I think are buggy are just skipped. And because I set the per_alloc flag to true I am rather confident that Nomad or the CSI driver will not actually try to attach a volume that is in use to another alloc/task.

This might have some other unintended consequences but for now it allows me to deploy updated versions of the job again without having to re-register the volume every time.

gregory112 commented 2 years ago

@JanMa Thanks, I'll try that. I tried deregistering my volume and reregistering it with access_mode = "multi-node-multi-writer" and set capability to multi-node-multi-writer in Nomad job volume stanza. But the output of nomad volume status is still single-node-writer. Maybe ceph-csi driver does not support that mode and reverts back to single-node-writer, but thank you for the work-around though, I'll try with other drivers later.

m1keil commented 2 years ago

Another +1 here. We are having these issues in both dev and prod with AWS EBS.

Nomad v1.1.1 with EBS CSI driver v1.1.0. No special update {} stanza (using default).

As others described, it looks like Nomad "forgets" the volume is no longer allocated. In my case, I even had a job that was not running for weeks, and when I tried to redeploy, Nomad still complained about CSI volume <name> has exhausted its available writer claims.

Inspecting volume status showed the volume as No allocations placed and checking AWS confirmed the volume isn't mounted to any instance.

Trying to deregister the volume returns Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: <name>) and requires -force flag.

crestonbunch commented 2 years ago

I'm seeing a lot of errors in logs of the form "error releasing volume claims [...] Permission denied".

ubuntu@ip-10-0-2-23:~$ journalctl -u nomad | grep "error releasing volume claims" -A 2 | tail
Nov 09 00:28:10 ip-10-0-2-23 nomad[24326]: "
Nov 09 00:36:15 ip-10-0-2-23 nomad[24326]:     2021-11-09T00:36:15.221Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=auth-scylladb[1] error="1 error occurred:
Nov 09 00:36:15 ip-10-0-2-23 nomad[24326]:         * Permission denied
Nov 09 00:36:15 ip-10-0-2-23 nomad[24326]: "
Nov 09 00:48:20 ip-10-0-2-23 nomad[24326]:     2021-11-09T00:48:20.202Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=auth-scylladb[1] error="1 error occurred:
Nov 09 00:48:20 ip-10-0-2-23 nomad[24326]:         * Permission denied
Nov 09 00:48:20 ip-10-0-2-23 nomad[24326]: "
Nov 09 01:08:27 ip-10-0-2-23 nomad[24326]:     2021-11-09T01:08:27.605Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=auth-scylladb[1] error="1 error occurred:
Nov 09 01:08:27 ip-10-0-2-23 nomad[24326]:         * Permission denied
Nov 09 01:08:27 ip-10-0-2-23 nomad[24326]: "

ubuntu@ip-10-0-2-23:~$ nomad --version
Nomad v1.1.4 (acd3d7889328ad1df2895eb714e2cbe3dd9c6d82)

Edit: I fixed this issue by giving way too many permissions to the nodes (mount-volumes did not seem to be enough). Unfortunately I still see the parent issue happening. So this seems unrelated.

rwojsznis commented 2 years ago

Hi all 👋 Was this issue ever fixed? I'm running nomad 1.1.6 with hetzner csi driver 1.6.0 and when it works it's great, but when it breaks it's a pure chaos - stuck controller allocation (even tho csi plugin was successfully started by docker and throws no errors), nomad throws 'exhausted claim' error once csi driver allocation succeeds - today I had to force deregister & register all volumes to get it working again.

Any way I can help debugging this?

samed commented 2 years ago

Another +1 from here. I've tried both EBS and ceph and hit this issue. Is this related to https://github.com/hashicorp/nomad/issues/10833 @tgross @lgfa29

MagicRB commented 2 years ago

I don't want to be mean towards the Nomad team, but this is a really big issue in my opinion, it basically disqualifies CSI for any serious HA deployments (not that my person website is HA), I'm purposefully avoiding CSI on a few key services to make sure they don't randomly die. Just my two cents, sucks that this is an issue... Nomad is otherwise a stellar system.

crestonbunch commented 2 years ago

+1. Nomad is truly an amazing alternative to Kubernetes. But at times I have considered switching just because of this issue. In the meantime I have stopped using CSI altogether.

My use case is simply to mount an AWS EBS volume for some databases. I can do that without CSI by running a script as a pre- and post-start lifecycle tasks. If anyone is interested, here are the scripts I use.

With CSI, tasks died every day and failed to restart. With this solution not a single task has failed.

raven-oscar commented 2 years ago

+1 CSI at the moment can be viewed extremely unreliable because of this bug.

schmichael commented 2 years ago

The Nomad team at HashiCorp hears you! Sorry we haven't shipped a fix yet, but it's on our roadmap. In the mean time I've pinned the issue to raise its visibility.

In the mean time can anyone confirm how viable the two workarounds mentioned are? I've seen:

Use multi-node-multi-writer access mode - May not work with all drivers or jobs - https://github.com/hashicorp/nomad/issues/10927#issuecomment-935747605
Force detach the volume: https://github.com/hashicorp/nomad/issues/10927#issuecomment-917676568

trilom commented 2 years ago

I will say from my experience dealing with wanting to use ceph with nomad has been kinda bad.

One thing that has caused a lot of pain is following guidance in this area https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin, for example using the --instance-id flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.

When coupled with the pain of "deregistering forcefully" volumes and re-registering them as a solution and the controller has changed causes it to not know in ceph where volumes are, alloc XYZ:VOLUME1 is not the same as alloc ABC:VOLUME1.

keslerm commented 2 years ago

@schmichael force de-registering the volume and re-creating it is how I've been dealing with it using the Linode CSI plugin

m1keil commented 2 years ago

@schmichael multi-node-multi-writer access mode isn't supported by AWS EBS so cannot be applied. As for force detach, that's really a workaround of the consequences, not really a a way to avoid the whole situation in the first place.

Cheers for raising the visibility on this issue.

JohnKiller commented 2 years ago

using the --instance-id flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.

@trilom are you saying that the documentation is wrong and that flag shouldn't be used or is this bug that is preventing that from working the way it's intended to? I'm also running ceph and the information on the internet is very scarce.

gregory112 commented 2 years ago

I will say from my experience dealing with wanting to use ceph with nomad has been kinda bad.

One thing that has caused a lot of pain is following guidance in this area https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin, for example using the --instance-id flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.

When coupled with the pain of "deregistering forcefully" volumes and re-registering them as a solution and the controller has changed causes it to not know in ceph where volumes are, alloc XYZ:VOLUME1 is not the same as alloc ABC:VOLUME1.

I still use ${node.unique.name}-nodes instance-id for Ceph plugin node and -controller for controller and use job type = "system".

Yet, I agree with you, using Ceph coupled with CSI and Nomad gave me a bad experience. On the brightside, the multi-node-multi-writer workaround works for me currently, but still the system needs manual intervention as the whole system does not survive reboot. My whole system (Nomad + Consul + Ceph) are in the same host, for testing, and when the system is rebooted, Ceph and Nomad run normally but the plugin does not. The controller and node run but the plugin is not registered in Nomad, or if it's registered, it shows no node in nomad plugin status, although all nodes are shown to be running in nomad job status. Probably it hangs because the plugin starts first before Ceph, but even if it does, it stays like that forever, so I have to manually restart the job again. I am still investigating this and so have not opened a new bug yet. I wonder if anyone see the same issue as well.

So yes, I agree with you. It is quite unreliable.

trilom commented 2 years ago

using the --instance-id flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.

@trilom are you saying that the documentation is wrong and that flag shouldn't be used or is this bug that is preventing that from working the way it's intended to? I'm also running ceph and the information on the internet is very scarce.

@JohnKiller Check the pool you are creating volumes against with this command rados ls -p RBD_POOL | grep csi.volume this will give you all the omaps created with the defaults. You should see one named csi.volumes.default if you don't assign instance-id but if you assign it then this creates an omap key for csi.volumes.INSTANCE_ID. If you ask the omap key what it's value is with rados listomapvals -p RBD_POOL csi.volumes.default then you can see that it relates the volume labels's(in nomad) to volume labels's(in ceph).

rados ls -p h2_nomad  | grep csi.volume
csi.volumes.default
csi.volume.792614e1-4ea2-11ec-816b-0242ac110009
rados listomapvals -p h2_nomad csi.volumes.default
csi.volume.prometheus_dev
value (36 bytes) :
00000000  01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01 01   |792614e1-4ea2-11|
00000000  01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01 01  |ec-816b-0242ac110009|
00000000  01 01 01 01                                       |000a|
00000001

Consider a scenario where the controller node changes, then this omap key is recreated and doesn't know the ID's of the volumes. My best idea to use this is if you have a pool with multiple nomad clusters claiming volumes then they both won't use csi.volumes.default where volume ID's will reside in the same omap K/V you can have them as csi.volumes.CLUSTER1 and csi.volumes.CLUSTER2 by assigning --instance-id=CLUSTER1. This is my experience with canary tag of ceph-csi.

For reference, here is what I am running for node and controller. Documented on the main branch as linked before, the node doesn't require the secret nor connection to the ceph monitors. However when running them and attempting to use a volume within a job will cause an error until you give the nodes the secret directory and monitor configuration.

job "plugin-ceph-csi-node" {
  priority = 94
  datacenters = ["h2"]
  type = "system"
  update {
    max_parallel = 1
  }
  group "cephrbd" {
    network {
      port "prometheus" {}
    }
    restart {
      attempts = 5
      interval = "2m"
      delay    = "15s"
      mode     = "fail"
    }
    service {
      name = "prometheus"
      port = "prometheus"
      tags = ["ceph-csi-node", "cephrbd"]
      check {
        type            = "http"
        path            = "/metrics"
        port            = "prometheus"
        interval        = "30s"
        timeout         = "5s"
        check_restart {
          limit           = 5
          grace           = "60s"
          ignore_warnings = false
        }
      }
    }
    task "plugin" {
      driver = "docker"
      config {
        image = "quay.io/cephcsi/cephcsi:canary"
        privileged = true
        ports      = ["prometheus"]
        args = [
          "--drivername=rbd.csi.ceph.com",
          "--v=5",
          "--type=rbd",
          "--nodeserver=true",
          "--nodeid=${node.unique.name}-${NOMAD_ALLOC_ID}",
          # this will define the omap key to which volumes are bound to
          # it should be unique per cluster, if it changes then volume-id's change
          #"--instanceid=h2",
          "--metricsport=${NOMAD_PORT_prometheus}",
          "--endpoint=unix://csi/csi.sock",
          "--enableprofiling" # this enables prometheus
        ]
        mount {
          type     = "bind"
          source   = "secrets"
          target   = "/tmp/csi/keys"
          readonly = false
        }
        mount {
          type     = "bind"
          source   = "ceph-csi-config/config.json"
          target   = "/etc/ceph-csi-config/config.json"
          readonly = false
        }
      }
      template {
        destination = "ceph-csi-config/config.json"
        data = <<-EOF
          [{
              "clusterID": "01",
              "monitors": [
                  {{range $index, $service := service "mon.ceph-h2z-in|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
              ]
          },{
              "clusterID": "02",
              "monitors": [
                  {{range $index, $service := service "mon.ceph-369-wtf|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
              ]
          }]
        EOF
      }
      csi_plugin {
        id        = "cephrbd"
        type      = "node"
        mount_dir = "/csi"
      }
      resources {
        cpu    = 75
        memory = 128
        memory_max = 256
      }
    }
  }
}

job "plugin-ceph-csi-controller" {
  priority = 95
  datacenters = ["h2"]
  group "cephrbd" {
    count = 1
    network {
      port "prometheus" {}
    }
    restart {
      attempts = 3
      interval = "2m"
      delay    = "15s"
      mode     = "fail"
    }
    service {
      name = "prometheus"
      port = "prometheus"
      tags = ["ceph-csi-controller", "cephrbd"]
      check {
        type            = "http"
        path            = "/metrics"
        port            = "prometheus"
        interval        = "30s"
        timeout         = "5s"
        check_restart {
          limit           = 2
          grace           = "60s"
          ignore_warnings = false
        }
      }
    }
    task "plugin" {
      driver = "docker"
      config {
        image = "quay.io/cephcsi/cephcsi:canary"
        ports = ["prometheus"]
        args = [
          "--drivername=rbd.csi.ceph.com",
          "--type=rbd",
          "--v=5", # verbose 5
          "--controllerserver=true",
          "--nodeid=${node.unique.name}",
          "--metricsport=${NOMAD_PORT_prometheus}",
          # this will define the omap key to which volumes are bound to
          # it should be unique per cluster, if it changes then volume-id's change
          #"--instanceid=h2",
          "--endpoint=unix://csi/csi.sock",
          "--enableprofiling" # this enables prometheus
        ]
        mount {
          type     = "bind"
          source   = "secrets"
          target   = "/tmp/csi/keys"
          readonly = false
        }
        mount {
          type     = "bind"
          source   = "ceph-csi-config/config.json"
          target   = "/etc/ceph-csi-config/config.json"
          readonly = false
        }
      }
      template {
        destination = "ceph-csi-config/config.json"
        data = <<-EOF
          [{
              "clusterID": "01",
              "monitors": [
                  {{range $index, $service := service "mon.ceph-h2z-in|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
              ]
          },{
              "clusterID": "02",
              "monitors": [
                  {{range $index, $service := service "mon.ceph-369-wtf|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
              ]
          }]
        EOF
      }
      csi_plugin {
        id        = "cephrbd"
        type      = "controller"
        mount_dir = "/csi"
      }
      resources {
        cpu    = 50
        memory = 96
        memory_max = 192
      }
    }
  }
}

trilom commented 2 years ago

I think my main issue that occurs now for me is what happens if a volume is unable to connect, it doesn't handle this well. Eventually a container will linger that cannot be killed (docker cannot kill it). When you check the host, there are kernel errors complaining about the connection to mon and sometimes osd.

I have not found a way to kill that process without killing the node. Investigating the node you can see that the rbd mount still exists so it must not have been unmounted and unmapped successfully. After killing the node, and deregistering and recreating the volume another node (or the same after rebooting) can claim the volume.

There is no graceful way to handle this other than working against a known good pattern of volume claims, introducing a ton of process, or switch to another workload orchestrator to have easy volumes.

tgross commented 2 years ago

Hi folks, can we split off the Ceph demo discussion to a new GitHub issue? I suspect the problem is just that the demo is only a demo, and the job spec assumed a single node (ref):

For demonstration purposes only, you can run Ceph as a single container Nomad job on the Vagrant VM managed by the Vagrantfile at the top-level of this repo.

But let's open a new issue for that so that it gets the attention it deserves!

For folks who are interested in this issue, which is about the unfortunate writer claims bug, as @schmichael said it's on our roadmap and I'm back on the case to fix this and some of the other outstanding issues around CSI to finally push it out of beta. We can't give a timeline but I'll update here and on other open issues regularly as we squash the remaining bugs. Thanks for your patience, all!

Ramblurr commented 2 years ago

Chiming in with an answer to this question

In the mean time can anyone confirm how viable the two workarounds mentioned are? I've seen:
1. Use multi-node-multi-writer access mode - May not work with all drivers or jobs -
2. Force detach the volume:

This works temporarily but does not survive node reboots
This does not work once a node has been rebooted (or replaced) because the volume isn't attached anywhere and so can't be detached, volume status <vol> shows No allocations placed.

The only way to recover is to deregister the volume manually.

I'm using the Ceph CSI and democratic-csi.

dcarbone commented 2 years ago

@tgross Point two as listed by @Ramblurr is extremely common. As an example, here is the cluster I'm using to test out csi for stuff in my home:

2021-12-12 15 27 30

This is after a single cluster-wide reboot (I replaced and upgraded some switching gear). To shut down, I did the following:

Drain all non-system jobs from each client, one at a time
Drain all system jobs from each client, one at a time
Stop the nomad client agent before shutting down the host, one at a time.
Stop the nomad server agent before shutting down the host, one at a time.

Even with this care, two volumes are now useless to me (syncthing-config and syncthing-data) in a literal sense. I cannot deregister them as one thing thinks they're allocated, and i cannot detach them as another thing thinks they aren't.

I believe this problem is caused by a bunch of places only checking the length of the alloc map(s), but not the actual contents of said map(s).

Some examples of this:

This is a problem as there are several places in the codebase where the map is written to with a nil value:

I haven't had a chance to dig through the logic too much, but it looks like it's expected for the gc routine to clean these up, but it seems like its pretty easy for things to end up in a permanently broken state.

As you can see in the gif, I went ahead and created another csi volume to replace the syncthing_config volume (and switched to iscsi, because why not) and I'm going to have to do the same thing for the data volume to get things back into a functional state. It also doesn't always happen, as evidenced by the unifi-data-iscsi volume being usable. It seems like the gc routine can get "stuck", perhaps it only checks the Claims map for state and not the Alloc map, assuming that something else is keeping them both in sync?

tgross commented 2 years ago

Hi @dcarbone yup, that's more-or-less the area I'm digging into. See also https://github.com/hashicorp/nomad/issues/8734 and https://github.com/hashicorp/nomad/issues/10052

dcarbone commented 2 years ago

Just as a bump, attempting to use terraform to update-in-place my syncthing job has resulted in my new syncthing-config-iscsi volume to become stuck forever :(

tgross commented 2 years ago

Just a heads up for folks eagerly watching this one, I've broken this work down into a handful of fixes to investigate and implement and the first is https://github.com/hashicorp/nomad/pull/11776. I'll continue to update here as I go.

tgross commented 2 years ago

Hi folks, just wanted to give an update here. I've got an environment set up where I can reproduce this bug reasonably reliably using the democratic-csi plugin (thanks to the folks at @democratic-csi for having a nice Nomad setup there) with NFS. I've found two clues to where the bug may be:

When (*csiHook) Prerun runs, it "claims" the volume at the server. This claim kicks off the controller RPCs, and when it returns we mount the volume via the node RPCs. If the node RPCs fail, we return an error and then the Postrun hook is supposed to unpublish (disclaim) the volume. But the raft logs I have here show we never disclaim the volume!
- That leaves a bunch of write claims sitting around for allocations that are dead, but the claims aren't marked as "past claims" that'll get cleaned up. So that looks like a promising avenue of investigation.
- I'm also questioning whether it makes sense to claim the volume from the client, instead of marking the claim during plan apply and having the client just drive the controller attachment RPCs. That's a sidebar issue.
Even in this scenario, I would absolutely expect the volume GC to pick up that the allocations associated with the volume are dead and mark the volume for reaping. But I don't see the unpublish RPC calls from the volume watcher either, which suggests either the GC job is not marking the volume for claim reaping correctly, or the volume watcher's logic is wrong. I have actual raft states here that failed this condition, so as long as it's not weirdly racy it's probably possible for me to feed that through some tests to figure out what's going wrong.

dcarbone commented 2 years ago

@tgross ~perhaps in the short term we could have a way to "force" detach volumes? I'm racking up quite a number of them :(~

FWIW, a very reliable way of triggering this problem is to try to update a job in-place. Update a template stanza, or one of the service.tags values. When I try to do this with Terraform, there is around a 1/3 chance that it'll cause at least one of the volumes to become "stuck".

Edited as @trilom's suggestion works perfectly :)