hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

"csi raft apply failed" error on garbage collection #17025

Open nicolasscott opened 1 year ago

nicolasscott commented 1 year ago

Nomad version

v1.5.3

Operating system and Environment details

Ubuntu 20.04

Issue

When nomad does a garbage collection, we see an error:

Apr 28 22:23:57 ip-10-0-0-99 nomad[1813]: {"@level":"error","@message":"csi raft apply failed","@module":"nomad.csi_plugin","@timestamp":"2023-04-28T22:23:57.876823Z","error":"plugin in use","method":"delete"}

I've tried with several versions of the plugin, so it appears that the issue is with nomad.

Reproduction steps

Use AWS EBS CSI plugin with following job specs:

job "plugin-aws-ebs-controller" {
  datacenters = ["dc1"]
  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.18.0"

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
job "plugin-aws-ebs-nodes" {
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.18.0"

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Register an AWS volume and do a nomad system gc

Expected Result

No errors

Actual Result

Log entry such as:

Apr 28 22:23:57 ip-10-0-0-99 nomad[1813]: {"@level":"error","@message":"csi raft apply failed","@module":"nomad.csi_plugin","@timestamp":"2023-04-28T22:23:57.876823Z","error":"plugin in use","method":"delete"}

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

thefallentree commented 1 year ago

we see this issue to, this happens when nomad fail to GC an volume: and currently we had to manually call nomad volume deregister to get back to a working state

jrasell commented 1 year ago

Hi @nicolasscott and thanks for the report; I'll get it added to our backlog.

ygersie commented 1 year ago

It looks like we're running into a similar issue in 1.6.1 with the following message on the leader:

    2023-08-21T10:39:03.249Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
    2023-08-21T10:39:03.249Z [ERROR] nomad.csi_volume: csi raft apply failed: error="volume max claims reached" method=claim

The claim is never released for allocations that have been garbage collected.

apollo13 commented 11 months ago

I am seeing this on the leader in my cluster every five minutes:

Sep 29 21:40:05 nomad03 nomad[663]:     2023-09-29T21:40:05.115+0200 [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Sep 29 21:45:05 nomad03 nomad[663]:     2023-09-29T21:45:05.115+0200 [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Sep 29 21:50:05 nomad03 nomad[663]:     2023-09-29T21:50:05.117+0200 [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Sep 29 21:55:05 nomad03 nomad[663]:     2023-09-29T21:55:05.114+0200 [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Sep 29 22:00:05 nomad03 nomad[663]:     2023-09-29T22:00:05.114+0200 [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete

@tgross Since you are the resident CSI expert -- any information I can get you from that node?

Himura2la commented 6 months ago

I also experience the same issue after we start using Nomad CSI with Ceph:

Feb 19 17:11:17 REDACTED nomad[39920]:     2024-02-19T17:11:17.183+0200 [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete