Closed unrue closed 10 months ago
I see that you have pass a --nnodes=2
argument. I am afraid that multi-node training is not supported at the moment.
It supports single node multi-GPU training.
Can you please start with a single node and 2 GPU training? That's what I have tested till date. I am trying to scale the code further but will take some time in doing so.
Ok, understood. The code is not multinode, but Multi-GPU yes. Why I don't see dist.all_reduce in training loop? Is somewhere? How gradients are synchronized among GPUs?
You are right. There is a mistake. I have used SyncBatchNorm but forgot all_reduce. I will surely push the correct and updated code as soon as possible.
Hi,
I'm using such tool on HPC machine, having 4 gpus per node. This is the launch command for 2 nodes and 4 gpu for each:
The code seems stuck and doing nothing. I'm doing something wrong?