How to run on multiple machines ?

tianrun-chen commented 1 year ago

Do you mean multiple GPUs?

AnnemSony commented 1 year ago

I have GPU'S in multiple machine(means on node clusters), how can I run the command.

chusheng0505 commented 1 year ago

Hi , I have 4 gpus and trying to tune the SAM-Adapter model I used the command provided in git command used : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch train.py --nnodes 1 --nproc_per_node 4 --config configs/demo.yaml

I have successed training but i found that there is only one gpu is used !! how can i solve this problem ...？(I have checked the documents of torch but don't have any idea to debug it ...? @tianrun-chen

Bill-Ren commented 1 year ago

I also encountered this problem, and only O cards were used during distributed training. At the same time, I did not find the input of these two parameters --nnodes 1 --nproc_per_node 4 in the input of train.py. Why?

Bill-Ren commented 1 year ago

Hi , I have 4 gpus and trying to tune the SAM-Adapter model I used the command provided in git command used : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch train.py --nnodes 1 --nproc_per_node 4 --config configs/demo.yaml

I have successed training but i found that there is only one gpu is used !! how can i solve this problem ...？(I have checked the documents of torch but don't have any idea to debug it ...? @tianrun-chen

I found a solution to the problem. Finally, I should run the code like this: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config configs/demo.yaml --tag exp1 , you can check the usage of torch.distributed.launch for details

tianrun-chen / SAM-Adapter-PyTorch

How to run on multiple machines ? #42