pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.27k stars 6.96k forks source link

Classification references does not work without distributed setup #6529

Open pmeier opened 2 years ago

pmeier commented 2 years ago

If you don't set the respective env vars

https://github.com/pytorch/vision/blob/d5bd8b728f14c33b339fc45c90ca39be339bce3f/references/classification/utils.py#L255-L258

training will not be distributed and in turn the backend will not be initialized. However, during evaluation we check

https://github.com/pytorch/vision/blob/d5bd8b728f14c33b339fc45c90ca39be339bce3f/references/classification/train.py#L88

unguarded, which then fails with

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

cc @datumbox

pmeier commented 2 years ago

Same for segmentation:

https://github.com/pytorch/vision/blob/cac4e228c9ca9e7564cb34406e7ebccfdd736976/references/segmentation/train.py#L84

YosuaMichael commented 2 years ago

I think this case is implicitly guarded on https://github.com/pytorch/vision/blob/d5bd8b728f14c33b339fc45c90ca39be339bce3f/references/classification/train.py#L87

since len(data_loader.dataset) != num_processed_samples shouldn't be true on non-distributed setting.

Do you get the error during non-distributed training @pmeier ?