Open 1999kevin opened 1 year ago
@1999kevin Can you tell me how to use a gpu to generate images using a pretrained model without the communication protocol nccl. Thank you
@1999kevin Can you tell me how to use a gpu to generate images using a pretrained model without the communication protocol nccl. Thank you
Just delete the mpiexec part in the command of the sampling.
@1999kevin but in image_samping.py ,I don't find mpiexec.thanks
Can I have a look at the code after your changes, thanks, I would appreciate it if you could send it over
Can I have a look at the code after your changes, thanks, I would appreciate it if you could send it over
I'm still working on training phase and not so sure about the inference phase. I guess you can follow Line 48 and Line 51 in scripts/launch.sh to sample the images. If you want to use one thread, just directly use the command: python image_sample.py ...
I add CUDA_VISIBLE_DEVICES=6,7
in front of the inference command to form CUDA_VISIBLE_DEVICES=6,7 mpiexec -n 2 python ./scripts/image_sample.py ...
, and change the code of ./cm/dist_util.py#L27
into:
if 'CUDA_VISIBLE_DEVICES' not in os.environ:
os.environ["CUDA_VISIBLE_DEVICES"] = f"{MPI.COMM_WORLD.Get_rank() % GPUS_PER_NODE}"
else:
gpu_inds_list = os.environ["CUDA_VISIBLE_DEVICES"].split(',')
idx = MPI.COMM_WORLD.Get_rank() % GPUS_PER_NODE
os.environ["CUDA_VISIBLE_DEVICES"] = gpu_inds_list[idx]
Does it work?
Does it work?
I will test it once I finish current training.
Does it work?
I will test it once I finish current training.
Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?
Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?
I also encounter simialr problems in my test. I train the model with batchsize 2 and 256 image size, costing me 35G memory.
Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?
I also encounter simialr problems in my test. I train the model with batchsize 2 and 256 image size, costing me 35G memory. Will the pre-training model also use such a large amount of Gpu memory?
Will the pre-training model also use such a large amount of Gpu memory?
Do not test such case currently.
I add CUDA_VISIBLE_DEVICES=6,7 in front of the inference command to form CUDA_VISIBLE_DEVICES=6,7 mpiexec -n 2 python ./scripts/image_sample.py ..., and change the code of ./cm/dist_util.py#L27
This change can definitely enable multiple GPUs training. However, it may cause error 'Expected q.stride(-1) == 1 to be true, but got false' as in issus #3. Change the flash attenion to defaclt can resolve the error
Nice Job! I wonder how I can run the code on a single linux server with multiple GPUs. I can run the code on the server with one GPU by not using mpiexec. But what if I want to use multiple GPUs as nn.DataParallel?