Open stas00 opened 3 years ago
Thanks for raising an issue @stas00 !
I understand the confusion/frustration regarding CUDA_VISIBLE_DEVICES
here. That environment variable specifies devices that a CUDA application can use at run time, but NVLM/PyNVML is an API for monitoring the system-level state of GPUs (and doesn't have anything to do with CUDA). For this reason, it doesn't really make sense for an API like nvmlDeviceGetHandleByIndex
to return an index that respects CUDA_VISIBLE_DEVICES
. With that said, there is no reason we cant provide a user-friendly mechanism to translate a CUDA-visible device index into a system-level device index.
From a user perspective, I like your suggestion to add an optional argument to something like nvmlInit
. However, I'd be hesitant to make any change that will result in a pyNVML function behaving differently than the NVML function of the same name. What if we add a separate API (something like cuda_id_to_index
) to do the same kind of mapping that you currently needing to do yourself? The user would still need to perform this cuda-to-system index translation in their code, but it could be as simple as an extra API call: e.g. pynvml.nvmlDeviceGetHandleByIndex(pynvml.cuda_id_to_index(1))
If the new API seems too messy, perhaps it would be better to add the optional kwarg to just the nvmlDeviceGetHandleByIndex
function, so that something like this would work: pynvml.nvmlDeviceGetHandleByIndex(1, cuda_device=True)
Both of your suggestions sound good to me, @rjzamora. I'm having a hard time deciding which one I prefer. I think the latter since it doesn't require an intermediary variable, which then may lead to a confusion in the code, as now it's easy to confuse which of the 2 ids to use for other non-pynvml functionality (if one doesn't stack calls). So my preference is pynvml.nvmlDeviceGetHandleByIndex(1, cuda_device=True)
. But either of them works.
If it's helpful please feel free to re-use the little remapper I wrote :)
And thank you so much for making pynvml much more than just a set of bindings!
I might suggest change cuda_device=True to use_cuda_visible_device=True to be explicit.
Thoughts
Be careful with this automagically remapping - this only works if the following is set:
import os
# https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
OR:
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,1"
If this environment variable is not set CUDA defaults to:
CUDA_DEVICE_ORDER="FASTEST_FIRST"
which means that your fastest GPU is set to index zero and so if your fastest GPU is in your device 1 (2nd device) slot in your machine then it will actually be at index zero and the code above would fail. You probably need to check both:
if "CUDA_VISIBLE_DEVICES" in os.environ:
if "CUDA_DEVICE_ORDER" not in os.environ or os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST":
# do something to tell the user this won't work
raise ValueError('''We can't remap if you are using os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST"''')
see this article for more details: https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
cuda_device=True to use_cuda_visible_device=True
were either of these implemented?
alternatively, is there a way to get the nvml compatible index given a torch.Device?
I understand that this is a python binding to nvml, which ignores
CUDA_VISIBLE_DEVICES
, but perhaps this feature could be respected inpynvml
? otherwise we end up with inconsistent behavior between pytorch (tf?) and pynvml.For example on this setup I have card 0 (24GB), card 1 (8GB).
If I run:
which is the output for card 0, even though I was expecting output for card 1.
The expected output is the one if I explicitly pass the system ID to nvml:
So I get the wrong card in the first snippet - I get card 0, rather than card 1, indexed as 0th.
The conflict with
pytorch
happens when I callid = torch.cuda.current_device()
- which returns0
withCUDA_VISIBLE_DEVICES="1"
. I hope my explanation is clear of where I have a problem.pynvml could respect
CUDA_VISIBLE_DEVICES
if the latter is set.Of course, if this is attempted, then we can't just change the normal behavior as it'd break people's code. Perhaps, if
pynvml.nvmlInit(respect_cuda_visible_devices=True)
is passed, then it could magically remap theid
arg tonvmlDeviceGetHandleByIndex
to the corresponding id inCUDA_VISIBLE_DEVICES
. So in the very first snippet above,nvmlDeviceGetHandleByIndex(0)
, will actually call it forid=1
, as it's0th
relative to `CUDA_VISIBLE_DEVICES="1".So
nvmlDeviceGetHandleByIndex()
arg will become an index with respect toCUDA_VISIBLE_DEVICES
. e.g. `CUDA_VISIBLE_DEVICES="1,0" will reverse the ids.Thank you!
Meanwhile I added the following workaround to my software:
If someone needs this as a helper wrapper, you can find it here: https://github.com/stas00/ipyexperiments/blob/3db0bbac2e2e6f1873b105953d9a7b3b7ca491b1/ipyexperiments/utils/mem.py#L33