hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Nomad / AWS EBS plugin doesn't appropriately report the volume's status #17332

Closed Zygimantass closed 8 months ago

Zygimantass commented 1 year ago

Nomad version

Nomad v1.4.4 (7f29429be12098e0f3a09df959d9272aa0654cba)

Operating system and Environment details

DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

Runs on AWS and uses EBS volumes.

Issue

Nomad says that a deployment is missing a CSI volume, when in reality the volume is ready and available. This especially manifests after restarting a job or restarting the EBS plugin.

Reproduction steps

  1. Create an AWS EBS volume
  2. Register it with Nomad
  3. Create a job that uses the EBS volume
  4. Restart the AWS EBS plugin
  5. Try to re-deploy the original job

Or alternatively, this also sometimes happens when just restarting a job or draining a node.

Expected Result

The job will restart successfully as the volume is not busy.

Actual Result

==> 2023-05-26T17:36:32+02:00: Monitoring evaluation "48a168b8"
    2023-05-26T17:36:33+02:00: Evaluation triggered by job "pisco-1-node"
    2023-05-26T17:36:34+02:00: Evaluation within deployment: "9198eafe"
    2023-05-26T17:36:34+02:00: Evaluation status changed: "pending" -> "complete"
==> 2023-05-26T17:36:34+02:00: Evaluation "48a168b8" finished with status "complete" but failed to place all allocations:
    2023-05-26T17:36:34+02:00: Task Group "node" (failed to place 1 allocation):
      * Constraint "missing CSI Volume pisco-1-node-volume": 5 nodes excluded by filter
      * Constraint "${meta.purpose} = node": 5 nodes excluded by filter
    2023-05-26T17:36:34+02:00: Evaluation "ef164672" waiting for additional capacity to place remainder
==> 2023-05-26T17:36:34+02:00: Monitoring deployment "9198eafe"
  ⠴ Deployment "9198eafe" in progress...

    2023-05-26T17:41:46+02:00
    ID          = 9198eafe
    Job ID      = pisco-1-node
    Job Version = 0
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    node        1        0       0        0          N/A

Job file (if appropriate)

Relevant part (how we mount the volume):

    volume "data" {
      type = "csi"
      read_only = false
      source = "${local.chain_id}-node-volume"
      access_mode = "single-node-writer"
      attachment_mode = "file-system"
    }

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented 1 year ago

Hi @Zygimantass 👋

Thanks for the report, would you be able to provide us with a sample job file of your workload and your CSI plugin? And do you see the volume listed when you run the nomad volume status command?

Thanks!

lgfa29 commented 8 months ago

Hello from the future 👋

I cleaning up some stale issue and we haven't had any updates on this one for a while, so I'm going to close it for now. But let me know if this is still a problem and we can reopen it.

Thank you!