Open lidc54 opened 1 year ago
Taichi can't use multiple GPUs at this moment. To use multi GPU you need to run taichi in different processes, so it wouldn't play well with torch's multi gpu solution
@turbo0628
It would be really really appreciated that taichi support mutiple gpu.
I sometimes write custom kernels naively, including makefile, cmake, and/or pytorch extension. Taich could disposes away cumbersome binding codes, which would focus us on algorithm and leads to minimal amount of debugging time.
I, reacently, first made a taichi kernel called in multiple processes spawed in pytorch lightning. However, I could not fix the illegal memory access error, which would seems caused by ti.init in multi process environment, so I moved back to the classical old process, which could take 2-3x more time than using taichi.
I have been running into the same issue. Is there any way around this? In theory it seems like Taichi should be able to bind to the correct GPU and run on that, but there seems to be some hardcoded logic making it bind to the first GPU which results in it causing an illegal memory access error. With torch's DistributedDataParallel
it would be fine as long as Taichi's context could be bound to the correct GPU given by the local rank. This not working currently precludes the use of Taichi within the implementation of any large models that require multi-GPU training.
@turbo0628
Are there any workarounds that would allow us to force Taichi onto a specific GPU index for each process? For example something that would allow us to set the GPU index in ti.init(ti.cuda)
using the code from blog_code being a layer in pytorch. And then I multi-gpu to see the result. But error occurs.
python -m torch.distributed.launch --nproc_per_node=2 main.py
we can see that RAM in gpu-6 is twice the one in gpu-7. when image become bigger or batch increase, it will be a big problems. I suppose ti.init may be the source of the problem. However, I can not find and solution in relative issue. does anyone know about it?