Open 810250082 opened 4 years ago
I have two machines, one GPU per machines. I tried distributed training as you said. but A stuck all the time , and B stuck after download MNIST dataset。 here is logs:
A $ python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank 0 --master_addr=192.168.2.138 --master_port=29500 main.py
B $ python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank 1 --master_addr=192.168.2.138 --master_port=29500 main.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/MNIST/raw/train-images-idx3-ubyte.gz 100.1%Extracting dataset/MNIST/raw/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/MNIST/raw/train-labels-idx1-ubyte.gz 113.5%Extracting dataset/MNIST/raw/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/MNIST/raw/t10k-images-idx3-ubyte.gz 100.4%Extracting dataset/MNIST/raw/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz 180.4%Extracting dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz Processing... Done!
Sorry for replying so late, is there still a problem now?
@lesliejackson hi! could you add codes loading model and optimizer from Interrupted trainging for distributed training? thks
I have two machines, one GPU per machines. I tried distributed training as you said. but A stuck all the time , and B stuck after download MNIST dataset。 here is logs:
A $ python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank 0 --master_addr=192.168.2.138 --master_port=29500 main.py
B $ python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank 1 --master_addr=192.168.2.138 --master_port=29500 main.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/MNIST/raw/train-images-idx3-ubyte.gz 100.1%Extracting dataset/MNIST/raw/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/MNIST/raw/train-labels-idx1-ubyte.gz 113.5%Extracting dataset/MNIST/raw/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/MNIST/raw/t10k-images-idx3-ubyte.gz 100.4%Extracting dataset/MNIST/raw/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz 180.4%Extracting dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz Processing... Done!