Multiple GPU usage and init

hvasbath commented 1 year ago

This is rather a question than an issue.

I have multiple GPUs at hand and I would want to run multiple processes each on a different GPU.

For now always all the processes automatically try to run only on the first GPU, which is then raising pycuda._driver.LogicError: cuMemAlloc failed: initialization error

Do you know how one would need to do this? I found your CUDAContextWrapper class, but I did not find where it is being used. I guess there one could pass different contexts from the different GPUs... ?

I found a post on how to build a GPUThread here: https://shephexd.github.io/development/2017/02/19/pycuda.html That hints at that. However, I do not fully grasp how I would have to integrate that to cutde ...

Thanks in advance for any suggestion!

tbenthompson commented 1 year ago

Ah yeah. I haven't ever used cutde in a multi-GPU settings but I would recommend:

pair one GPU with one process. Either by launching separate python processes or using multiprocessing.
Then set the CUDA_DEVICE environment variable before using cutde. Ideally, before any any imports:
```
import os
os.environ['CUDA_DEVICE'] = ...
```

import ...

or

CUDA_DEVICE=5 python script.py



### Some explanation:
- `cutde` runs this `ensure_initialized` function:  https://github.com/tbenthompson/cutde/blob/3ac6a3e19f70ba50fa87a16d81bec4d96760e273/cutde/cuda.py#L14-L23
- that function gets run on the first call to a cutde function. For example, `compile`:  https://github.com/tbenthompson/cutde/blob/3ac6a3e19f70ba50fa87a16d81bec4d96760e273/cutde/cuda.py#L82-L85
- internally, it uses `pycuda.autoinit`. https://documen.tician.de/pycuda/util.html#module-pycuda.autoinit
- which calls `pycuda.tools.make_default_context` https://documen.tician.de/pycuda/util.html#pycuda.tools.make_default_context
- which accepts the CUDA_DEVICE environment variable for setting the device number.

Hope that's useful! I'm excited that you've got so many questions. It means your really battle testing cutde on different tasks and hardware settings!

tbenthompson commented 1 year ago

Oh damn. I just realized that "--use_fast_math" compiler argument is probably the reason for those NaN bugs you fixed a few weeks back... Without that, I bet the floating point behavior would've been identical to Matlab. Well, now that the bugs are fixed, it's probably fine to keep the flag, but I should've thought to check this back then.

hvasbath commented 1 year ago

Awesome! The environment variable helps already a ton as a workaround. The multiprocessing indeed would be what I ultimately would want to use. I am running several MCMC chains in parallel from a parent process that then creates a multiprocessing Pool that then runs several forks of the sampling process. So I would need to find a way to communicate to pycuda after the forking which worker in the pool is supposed to be using which GPU. So for that the preimport or ENV variable would not work- likely I would need to find a way to modify your ensure_initialized function and assign different CUDA contexts to the processes. Now it always tries to create one context on the same GPU during the forking.

If indeed the functions are now resulting in a faster runtime due to that flag, I am all happy ;) . I learned a lot doing the fixes dont worry ;) . Its all fun!

tbenthompson commented 1 year ago

So you should be able to use the environment flag combined with multiprocessing. You'll just need to wait to use cutde until after setting the environ flag. Like:

import multiprocessing as mp

def process(device_idx):
    import os
    os.environ['CUDA_DEVICE'] = str(device_idx)
    import cutde
    # your code here.

def main():
    pool = mp.Pool()
    pool.map(process, list(range(4)))

if __name__ == "__main__":
    main()

tbenthompson commented 1 year ago

If indeed the functions are now resulting in a faster runtime due to that flag, I am all happy ;) . I learned a lot doing the fixes dont worry ;) . Its all fun!

=) glad!!

hvasbath commented 1 year ago

Hm the waiting is exactly the issue. I am compiling a computational graph in the parent process, where cutde has to be imported once before forking to the child processes. It is a little confusing that the autoinit does not work as they state that the try to use available GPU devices in a "round robin" fashion.

According to this https://www.mail-archive.com/pycuda@tiker.net/msg03681.html I also tried setting the GPUs to exclusive mode through nvidia-smi, however, it still didnt work.

tbenthompson commented 1 year ago

What do you mean by "compiling a computational graph"?

I would guess that CUDA will fail completely if you initialize CUDA in the parent process and then try to use it from child processes because the CUDA context will not behave properly when forked like that.

You might be able to switch from fork to spawn for multiprocessing process launching: https://stackoverflow.com/a/55812288

I have access to multi-GPU cloud instance at work so I might be able to help debug if you could put together a minimum reproducing example.

I'm also confused by autoinit failing but I don't fully understand your setup right now.

hvasbath commented 1 year ago

Awesome! Thanks for the hint! Together with the hint from here using a multiprocessing context: https://github.com/pytorch/pytorch/issues/3492 I am a step closer to what I need! Now I can run several subprocesses without crashing as before, however they still use the same GPU ...

tbenthompson commented 4 months ago

It looks like this got resolved? Feel free to re-open if needed!

hvasbath commented 4 months ago

Just had no time to investigate further- will probably also never happen.

tbenthompson commented 4 months ago

No worries! I've dealt with a bunch of CUDA stuff in other settings and this is a consistent issue where fork-ing a process doesn't work after having already initialized CUDA. So there's no way to solve this issue within cutde - it's just a basic limitation of CUDA.

tbenthompson / cutde

Multiple GPU usage and init #33