hashicorp / nomad-device-nvidia

Nomad device driver for Nvidia GPU
Mozilla Public License 2.0
16 stars 7 forks source link

add `gpu uuid` to the fingerprinted attributes #11

Closed shumin1027 closed 1 year ago

shumin1027 commented 1 year ago

I want to use gpu uuid when configuring affinity or constraint,but there is no such attribute in fingerprinted attributes,how should it be achieved?

Fingerprinted Attributes

lgfa29 commented 1 year ago

Hi @shumin1027 👋

I believe the UUID is just missing from the documentation as the fingerprinter does report the UUID.

Could you check if the GPU UUID is present when you run nomad node status -verbose <node ID>?

shumin1027 commented 1 year ago

Hi @shumin1027 👋

I believe the UUID is just missing from the documentation as the fingerprinter does report the UUID.

Could you check if the GPU UUID is present when you run nomad node status -verbose <node ID>?

@lgfa29 Thank you, It's not that the documentation is missing,all fingerprinted attributesare defined here: https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L13-L24

And here's how to populate the attribute data, it can be seen that there is indeed no GPU UUID https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L167

https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L181

shumin1027 commented 1 year ago

@lgfa29 By the way,when I try to solve this problem by adding GPU UUID to fingerprinted attributes here: https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L181

But a new problem was found: When a device like the NVIDIA Tesla K80 with a Dual GPU,there will be a conflict : https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L164-L167

Two GPUs with different UUIDs will be treated as one device with the same fingerprinted attributes

I still don't know how to solve this problem elegantly

lgfa29 commented 1 year ago

Oh, you're right @shumin1027, I misunderstood the code. I don't think it's possible to fix this at the plugin level, it seems like a limitation within Nomad.

I opened https://github.com/hashicorp/nomad/pull/15455 to try and fix this. I'm building custom binaries for you to test if you have the chance and I will post them here once they're ready.

lgfa29 commented 1 year ago

Here are the custom binaries: https://github.com/hashicorp/nomad/actions/runs/3604896100#artifacts

shumin1027 commented 1 year ago

@lgfa29 It's great, I will continue to test

lgfa29 commented 1 year ago

Nice! If the fix for you feel free to close this issue. The expect the PR to be merged soon and for it to be released in the next version of Nomad.

ruspaul013 commented 1 year ago

Hello @shumin1027, @lgfa29 ! I tried using device.ids as constraint in a job file and every time I get random GPUs instead of the ones I set the UUIDs. Here is the link to nomad forum post that I have made, where are more details about the issue.

Thank you!

lgfa29 commented 1 year ago

Hi @ruspaul013 👋

I answered in your post, but TL;DR: yes, you need to upgrade your Nomad clients to a version that includes this change.

ruspaul013 commented 1 year ago

Hello @lgfa29 👋 Thank you so much for the answer. I posted an update on the forum.