Open shangw-nvidia opened 3 years ago
Along this lines, i have tried to run python main-wds.py --dist-url 'tcp://127.0.0.1:9999' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --batch-size 512
on an 8-gpu node and got:
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/dcg-adlr-atao-data.cosmos277/data/pytorch-imagenet-wds/main-wds.py", line 219, in main_worker
sampler.set_epoch(epoch)
AttributeError: '_InfiniteConstantSampler' object has no attribute 'set_epoch'
It would certainly help to have a known good example of DDP + webdataset.
The way sharding is handled has changed in recent versions of WebDataset, and that probably broke --multiprocessing-distributed; I'll have a look.
Note that this repository is just a minimal port of the existing Imagenet example; for large distributed training, you probably want a different training loop from the PyTorch Imagenet example.
I'll add working examples to the WebDataset distribution
Hi,
Not that I'm asking
main-wds.py
to support every flag there is in the original pytorch/examples' imagenet, but I want to double-check that my understanding is correct (i.e., this is a question rather than a feature request) -It seems to me that the current version of
worker_urls
only support single-GPU training, becausewds.worker_urls
only slice the list of urls by theDataLoader
worker id? Thus, if we turn on--multiprocessing-distributed
, each process will distribute the list of urls to their ownDataLoader
workers in the same way, resulting in each process will go over the sequence of the samplesnum_workers
times. In order to support--multiprocessing-distributed
, not onlyworker_urls
needs to slice the urls by theDataLoader
worker id, but also by the process' rank.Thanks!