This happens when using a data loader made out of torch data datapipes, with the fullsync data pipe at the end.
This is a significant problem because it prevents multi-epoch training (fullsync is necessary for multi-epoch training if each datapipe can have different length) when using torchdata, which is Pytorch's new data-loading mechanism.
Versions / Dependencies
OS:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
What happened + What you expected to happen
Copied from: https://discuss.ray.io/t/get-distributed-process-group-timeout-when-using-torch-trainer-fullsynciterdatapipe/8075, as I'm not sure if this is a better place to be submitting bug reports.
This line: data/prefetch.py at 4ea88d1fb4d279def9213a23b054b4e7d46d5b3d · pytorch/data · GitHub 3 times out when using the TorchTrainer. This means the training script never runs, since it gets stuck on initializing the dataloader.
This happens when using a data loader made out of torch data datapipes, with the
fullsync
data pipe at the end. This is a significant problem because it prevents multi-epoch training (fullsync
is necessary for multi-epoch training if each datapipe can have different length) when using torchdata, which is Pytorch's new data-loading mechanism.Versions / Dependencies
OS:
Deps:
Docker image:
rayproject/ray:6f5f1e-py38-cu116
.--extra-index-url\ https://download.pytorch.org/whl/nightly/cu116
.Reproduction script
Issue Severity
High: It blocks me from completing my task.