gpuopenanalytics / pynvml

Provide Python access to the NVML library for GPU diagnostics
BSD 3-Clause "New" or "Revised" License
205 stars 31 forks source link

adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28

Open stas00 opened 3 years ago

stas00 commented 3 years ago

I understand that this is a python binding to nvml, which ignores CUDA_VISIBLE_DEVICES, but perhaps this feature could be respected in pynvml? otherwise we end up with inconsistent behavior between pytorch (tf?) and pynvml.

For example on this setup I have card 0 (24GB), card 1 (8GB).

If I run:

CUDA_VISIBLE_DEVICES=1 python -c "import pynvml; pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(0); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
25447170048

which is the output for card 0, even though I was expecting output for card 1.

The expected output is the one if I explicitly pass the system ID to nvml:

python -c "import pynvml;pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(1); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
8513978368

So I get the wrong card in the first snippet - I get card 0, rather than card 1, indexed as 0th.

The conflict with pytorch happens when I call id = torch.cuda.current_device() - which returns 0 with CUDA_VISIBLE_DEVICES="1". I hope my explanation is clear of where I have a problem.

pynvml could respect CUDA_VISIBLE_DEVICES if the latter is set.

Of course, if this is attempted, then we can't just change the normal behavior as it'd break people's code. Perhaps, if pynvml.nvmlInit(respect_cuda_visible_devices=True) is passed, then it could magically remap the id arg to nvmlDeviceGetHandleByIndex to the corresponding id in CUDA_VISIBLE_DEVICES. So in the very first snippet above, nvmlDeviceGetHandleByIndex(0), will actually call it for id=1, as it's 0th relative to `CUDA_VISIBLE_DEVICES="1".

So nvmlDeviceGetHandleByIndex() arg will become an index with respect to CUDA_VISIBLE_DEVICES. e.g. `CUDA_VISIBLE_DEVICES="1,0" will reverse the ids.

Thank you!

Meanwhile I added the following workaround to my software:

    import os
    [...]
    if id is None:
        id = torch.cuda.current_device()
    # if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var
    if "CUDA_VISIBLE_DEVICES" in os.environ:
        ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
        id = ids[id] # remap
    try:
        handle = pynvml.nvmlDeviceGetHandleByIndex(id)
        [...]

If someone needs this as a helper wrapper, you can find it here: https://github.com/stas00/ipyexperiments/blob/3db0bbac2e2e6f1873b105953d9a7b3b7ca491b1/ipyexperiments/utils/mem.py#L33

rjzamora commented 3 years ago

Thanks for raising an issue @stas00 !

I understand the confusion/frustration regarding CUDA_VISIBLE_DEVICES here. That environment variable specifies devices that a CUDA application can use at run time, but NVLM/PyNVML is an API for monitoring the system-level state of GPUs (and doesn't have anything to do with CUDA). For this reason, it doesn't really make sense for an API like nvmlDeviceGetHandleByIndex to return an index that respects CUDA_VISIBLE_DEVICES. With that said, there is no reason we cant provide a user-friendly mechanism to translate a CUDA-visible device index into a system-level device index.

From a user perspective, I like your suggestion to add an optional argument to something like nvmlInit. However, I'd be hesitant to make any change that will result in a pyNVML function behaving differently than the NVML function of the same name. What if we add a separate API (something like cuda_id_to_index) to do the same kind of mapping that you currently needing to do yourself? The user would still need to perform this cuda-to-system index translation in their code, but it could be as simple as an extra API call: e.g. pynvml.nvmlDeviceGetHandleByIndex(pynvml.cuda_id_to_index(1))

If the new API seems too messy, perhaps it would be better to add the optional kwarg to just the nvmlDeviceGetHandleByIndex function, so that something like this would work: pynvml.nvmlDeviceGetHandleByIndex(1, cuda_device=True)

stas00 commented 3 years ago

Both of your suggestions sound good to me, @rjzamora. I'm having a hard time deciding which one I prefer. I think the latter since it doesn't require an intermediary variable, which then may lead to a confusion in the code, as now it's easy to confuse which of the 2 ids to use for other non-pynvml functionality (if one doesn't stack calls). So my preference is pynvml.nvmlDeviceGetHandleByIndex(1, cuda_device=True). But either of them works.

If it's helpful please feel free to re-use the little remapper I wrote :)

And thank you so much for making pynvml much more than just a set of bindings!

kenhester commented 3 years ago

I might suggest change cuda_device=True to use_cuda_visible_device=True to be explicit.

Thoughts

Alex-ley commented 2 years ago

Be careful with this automagically remapping - this only works if the following is set:

import os
# https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

OR:

export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,1"

If this environment variable is not set CUDA defaults to:

CUDA_DEVICE_ORDER="FASTEST_FIRST"

which means that your fastest GPU is set to index zero and so if your fastest GPU is in your device 1 (2nd device) slot in your machine then it will actually be at index zero and the code above would fail. You probably need to check both:

if "CUDA_VISIBLE_DEVICES" in os.environ:
    if "CUDA_DEVICE_ORDER" not in os.environ or os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST":
        # do something to tell the user this won't work
        raise ValueError('''We can't remap if you are using os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST"''')

see this article for more details: https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/

tmm1 commented 11 months ago

cuda_device=True to use_cuda_visible_device=True

were either of these implemented?

alternatively, is there a way to get the nvml compatible index given a torch.Device?