hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

csi: volume `ControllerRequired ` is set to `false` when the plugin stops #18235

Open lgfa29 opened 1 year ago

lgfa29 commented 1 year ago

Nomad version

1.6.1

Operating system and Environment details

Ubuntu 22.04

Issue

While investigating https://github.com/hashicorp/terraform-provider-nomad/issues/367 I noticed panics on the CSI delete endpoint. https://github.com/hashicorp/nomad/pull/18234 fixes the nil panic, but it seems like there is a bigger underlying problem where the volume ControllerRequired field is set to false when the plugin job stops.

Reproduction steps

  1. Setup the hostpach CSI demo.
  2. Stop plugin job.
  3. Delete volume.

Expected Result

ControllerRequired remains true.

Actual Result

ControllerRequired becomes false.

tgross commented 1 year ago

A hypothesis for whomever picks this up: when a plugin allocation stops it deregisters itself (or technically, the dynamic plugin registry on the client updates the node's fingerprint and that deregisters the plugin). Once all plugins for a given controller have stopped, Nomad has no way of knowing whether a controller plugin is required for a given volume. When volumes are queried from the state store, we denormalize the plugins (which are on a different memdb table) for that volume, so that might result in the behavior we're seeing here.