tmbdev-archive / pytorch-imagenet-wds

25 stars 4 forks source link

Does main-wds.py support "--multiprocessing-distributed"? #1

Open shangw-nvidia opened 3 years ago

shangw-nvidia commented 3 years ago

Hi,

Not that I'm asking main-wds.py to support every flag there is in the original pytorch/examples' imagenet, but I want to double-check that my understanding is correct (i.e., this is a question rather than a feature request) -

It seems to me that the current version of worker_urls only support single-GPU training, because wds.worker_urls only slice the list of urls by the DataLoader worker id? Thus, if we turn on --multiprocessing-distributed, each process will distribute the list of urls to their own DataLoader workers in the same way, resulting in each process will go over the sequence of the samples num_workers times. In order to support --multiprocessing-distributed, not only worker_urls needs to slice the urls by the DataLoader worker id, but also by the process' rank.

Thanks!

ajtao commented 3 years ago

Along this lines, i have tried to run python main-wds.py --dist-url 'tcp://127.0.0.1:9999' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --batch-size 512 on an 8-gpu node and got:

torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
     fn(i, *args)
   File "/home/dcg-adlr-atao-data.cosmos277/data/pytorch-imagenet-wds/main-wds.py", line 219, in main_worker
     sampler.set_epoch(epoch)
AttributeError: '_InfiniteConstantSampler' object has no attribute 'set_epoch'

It would certainly help to have a known good example of DDP + webdataset.

tmbdev commented 3 years ago

The way sharding is handled has changed in recent versions of WebDataset, and that probably broke --multiprocessing-distributed; I'll have a look.

Note that this repository is just a minimal port of the existing Imagenet example; for large distributed training, you probably want a different training loop from the PyTorch Imagenet example.

I'll add working examples to the WebDataset distribution