Closed shumin1027 closed 1 year ago
Hi @shumin1027 👋
I believe the UUID is just missing from the documentation as the fingerprinter does report the UUID.
Could you check if the GPU UUID is present when you run nomad node status -verbose <node ID>
?
Hi @shumin1027 👋
I believe the UUID is just missing from the documentation as the fingerprinter does report the UUID.
Could you check if the GPU UUID is present when you run
nomad node status -verbose <node ID>
?
@lgfa29 Thank you, It's not that the documentation is missing,all fingerprinted attributesare defined here: https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L13-L24
And here's how to populate the attribute data, it can be seen that there is indeed no GPU UUID
https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L167
@lgfa29
By the way,when I try to solve this problem by adding GPU UUID
to fingerprinted attributes here:
https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L181
But a new problem was found:
When a device like the NVIDIA Tesla K80
with a Dual GPU
,there will be a conflict :
https://github.com/hashicorp/nomad-device-nvidia/blob/541c8ca6aee84e14c4f0e79c39c1a2c431008dff/fingerprint.go#L164-L167
Two GPUs with different UUIDs will be treated as one device with the same fingerprinted attributes
I still don't know how to solve this problem elegantly
Oh, you're right @shumin1027, I misunderstood the code. I don't think it's possible to fix this at the plugin level, it seems like a limitation within Nomad.
I opened https://github.com/hashicorp/nomad/pull/15455 to try and fix this. I'm building custom binaries for you to test if you have the chance and I will post them here once they're ready.
Here are the custom binaries: https://github.com/hashicorp/nomad/actions/runs/3604896100#artifacts
@lgfa29 It's great, I will continue to test
Nice! If the fix for you feel free to close this issue. The expect the PR to be merged soon and for it to be released in the next version of Nomad.
Hello @shumin1027, @lgfa29 ! I tried using device.ids as constraint in a job file and every time I get random GPUs instead of the ones I set the UUIDs. Here is the link to nomad forum post that I have made, where are more details about the issue.
Thank you!
Hi @ruspaul013 👋
I answered in your post, but TL;DR: yes, you need to upgrade your Nomad clients to a version that includes this change.
Hello @lgfa29 👋 Thank you so much for the answer. I posted an update on the forum.
I want to use
gpu uuid
when configuringaffinity
orconstraint
,but there is no such attribute in fingerprinted attributes,how should it be achieved?Fingerprinted Attributes