Closed hvasbath closed 4 months ago
Ah yeah. I haven't ever used cutde in a multi-GPU settings but I would recommend:
multiprocessing
.CUDA_DEVICE
environment variable before using cutde. Ideally, before any any imports:
import os
os.environ['CUDA_DEVICE'] = ...
import ...
or
CUDA_DEVICE=5 python script.py
### Some explanation:
- `cutde` runs this `ensure_initialized` function: https://github.com/tbenthompson/cutde/blob/3ac6a3e19f70ba50fa87a16d81bec4d96760e273/cutde/cuda.py#L14-L23
- that function gets run on the first call to a cutde function. For example, `compile`: https://github.com/tbenthompson/cutde/blob/3ac6a3e19f70ba50fa87a16d81bec4d96760e273/cutde/cuda.py#L82-L85
- internally, it uses `pycuda.autoinit`. https://documen.tician.de/pycuda/util.html#module-pycuda.autoinit
- which calls `pycuda.tools.make_default_context` https://documen.tician.de/pycuda/util.html#pycuda.tools.make_default_context
- which accepts the CUDA_DEVICE environment variable for setting the device number.
Hope that's useful! I'm excited that you've got so many questions. It means your really battle testing cutde on different tasks and hardware settings!
Oh damn. I just realized that "--use_fast_math"
compiler argument is probably the reason for those NaN bugs you fixed a few weeks back... Without that, I bet the floating point behavior would've been identical to Matlab. Well, now that the bugs are fixed, it's probably fine to keep the flag, but I should've thought to check this back then.
Awesome! The environment variable helps already a ton as a workaround. The multiprocessing indeed would be what I ultimately would want to use. I am running several MCMC chains in parallel from a parent process that then creates a multiprocessing Pool that then runs several forks of the sampling process. So I would need to find a way to communicate to pycuda after the forking which worker in the pool is supposed to be using which GPU. So for that the preimport or ENV variable would not work- likely I would need to find a way to modify your ensure_initialized function and assign different CUDA contexts to the processes. Now it always tries to create one context on the same GPU during the forking.
If indeed the functions are now resulting in a faster runtime due to that flag, I am all happy ;) . I learned a lot doing the fixes dont worry ;) . Its all fun!
So you should be able to use the environment flag combined with multiprocessing. You'll just need to wait to use cutde until after setting the environ flag. Like:
import multiprocessing as mp
def process(device_idx):
import os
os.environ['CUDA_DEVICE'] = str(device_idx)
import cutde
# your code here.
def main():
pool = mp.Pool()
pool.map(process, list(range(4)))
if __name__ == "__main__":
main()
If indeed the functions are now resulting in a faster runtime due to that flag, I am all happy ;) . I learned a lot doing the fixes dont worry ;) . Its all fun!
=) glad!!
Hm the waiting is exactly the issue. I am compiling a computational graph in the parent process, where cutde has to be imported once before forking to the child processes. It is a little confusing that the autoinit does not work as they state that the try to use available GPU devices in a "round robin" fashion.
According to this https://www.mail-archive.com/pycuda@tiker.net/msg03681.html I also tried setting the GPUs to exclusive mode through nvidia-smi, however, it still didnt work.
What do you mean by "compiling a computational graph"?
I would guess that CUDA will fail completely if you initialize CUDA in the parent process and then try to use it from child processes because the CUDA context will not behave properly when forked like that.
You might be able to switch from fork
to spawn
for multiprocessing process launching: https://stackoverflow.com/a/55812288
I have access to multi-GPU cloud instance at work so I might be able to help debug if you could put together a minimum reproducing example.
I'm also confused by autoinit failing but I don't fully understand your setup right now.
Awesome! Thanks for the hint! Together with the hint from here using a multiprocessing context: https://github.com/pytorch/pytorch/issues/3492 I am a step closer to what I need! Now I can run several subprocesses without crashing as before, however they still use the same GPU ...
It looks like this got resolved? Feel free to re-open if needed!
Just had no time to investigate further- will probably also never happen.
No worries! I've dealt with a bunch of CUDA stuff in other settings and this is a consistent issue where fork
-ing a process doesn't work after having already initialized CUDA. So there's no way to solve this issue within cutde
- it's just a basic limitation of CUDA.
This is rather a question than an issue.
I have multiple GPUs at hand and I would want to run multiple processes each on a different GPU.
For now always all the processes automatically try to run only on the first GPU, which is then raising pycuda._driver.LogicError: cuMemAlloc failed: initialization error
Do you know how one would need to do this? I found your CUDAContextWrapper class, but I did not find where it is being used. I guess there one could pass different contexts from the different GPUs... ?
I found a post on how to build a GPUThread here: https://shephexd.github.io/development/2017/02/19/pycuda.html That hints at that. However, I do not fully grasp how I would have to integrate that to cutde ...
Thanks in advance for any suggestion!