hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.8k stars 1.94k forks source link

Poor handling of "nomad volume status -verbose" for CSI plugins without LIST_VOLUMES capability #15040

Open ejweber opened 1 year ago

ejweber commented 1 year ago

Nomad version

Nomad v1.3.3 (428b2cd8014c48ee9eae23f02712b7219da16d30) Tested with this version but v1.4.1 code has the same issues (and is linked below).

Operating system and Environment details

Red Hat Enterprise Linux release 8.4 (Ootpa)

Issue

The BeeGFS CSI Driver does not support the CSI LIST_VOLUMES capability. When nomad volume status -verbose is run, a raw HTTP error code and the message "unimplemented for this plugin" are shown.

Reproduction steps

Expected Result

Similar output to when nomad volume status is run, perhaps with an additional message indicating specifically that LIST_VOLUMES is not implemented or that listing external volumes is not supported by this plugin.

Actual Result

Similar output to when nomad volume status is run, followed by an raw HTTP error code and the message "unimplemented for this plugin".

Nomad command output

webere@webere-dev:~/beegfs-csi-briver$ nomad volume status
Container Storage Interface
ID                 Name               Plugin ID          Schedulable  Access Mode
beegfs-csi-volume  beegfs-csi-volume  beegfs-csi-plugin  true         multi-node-multi-writer

webere@webere-dev:~/beegfs-csi-briver$ nomad volume status --verbose
Container Storage Interface
ID                 Name               Plugin ID          Schedulable  Access Mode
beegfs-csi-volume  beegfs-csi-volume  beegfs-csi-plugin  true         multi-node-multi-writer
Error querying CSI external volumes for plugin "beegfs-csi-plugin": Unexpected response code: 500 (rpc error: unimplemented for this plugin)

Code deep dive

This only occurs because we execute "nomad volume status" with BOTH "-verbose" AND no specified ID.

No specified ID invokes "list mode":

https://github.com/hashicorp/nomad/blob/v1.4.1/command/volume_status_csi.go#L20-L23

No "-verbose" returns early:

https://github.com/hashicorp/nomad/blob/v1.4.1/command/volume_status_csi.go#L104-L106

Both conditions together ultimately result in the invocation of the ListVolumes RPC (and the documented failure message):

https://github.com/hashicorp/nomad/blob/v1.4.1/command/volume_status_csi.go#L128-L136

A Nomad server returns a 500 error code and the message "unimplemented for this plugin" when it recognizes that a driver doesn't advertise the LIST_VOLUMES capability.

https://github.com/hashicorp/nomad/blob/v1.4.1/nomad/csi_endpoint.go#L1189-L1191

We are supposed to see something like this for drivers that support the LIST_VOLUMES capability. Nomad already quietly moves on without printing the external list portion if it thinks nothing went wrong and doesn't have a list of volumes.

https://github.com/hashicorp/nomad/blob/v1.4.1/command/volume_status_csi.go#L137-L140

It seems like we just need a way for Nomad to understand to move on quietly in this circumstance as well. Maybe the Nomad server should just return an empty list instead of a 500 error when the LIST_VOLUMES capability isn't supported? Maybe the Nomad command should know to look for "unimplemented for this plugin" in the error output and move on quietly (or optionally print something more informative)?

tgross commented 1 year ago

Hi @ejweber! That's definitely not great UX in retrospect. The API error code is an unfortunate byproduct of the way we translate from RPC calls to HTTP API. I'm not sure I'd want to return an empty list here, which would confuse the cases of "this can never work" and "there's no results to be had". But there's no reason the CLI can't present that in a nicer way.

I'll mark this for roadmapping to clean it up on the CLI side of things.