lesliejackson / PyTorch-Distributed-Training

Example of PyTorch DistributedDataParallel
59 stars 24 forks source link

can not continue to run #1

Open 810250082 opened 4 years ago

810250082 commented 4 years ago

I have two machines, one GPU per machines. I tried distributed training as you said. but A stuck all the time , and B stuck after download MNIST dataset。 here is logs:

A $ python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank 0 --master_addr=192.168.2.138 --master_port=29500 main.py

B $ python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank 1 --master_addr=192.168.2.138 --master_port=29500 main.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/MNIST/raw/train-images-idx3-ubyte.gz 100.1%Extracting dataset/MNIST/raw/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/MNIST/raw/train-labels-idx1-ubyte.gz 113.5%Extracting dataset/MNIST/raw/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/MNIST/raw/t10k-images-idx3-ubyte.gz 100.4%Extracting dataset/MNIST/raw/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz 180.4%Extracting dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz Processing... Done!

lesliejackson commented 4 years ago

Sorry for replying so late, is there still a problem now?

longweiwei commented 3 years ago

@lesliejackson
hi! could you add codes loading model and optimizer from Interrupted trainging for distributed training? thks