tcp connection refused when multinode training

rxqy commented 1 year ago

🐛 Describe the bug

Hi, we are training in webdataset format with torchdata. Everything works fine on a single-node machine. We then move to a multi-node cluster and would have the following error. I manually cleaned up some errors from the other ranks, so please tell me if I missed anything important.

for i, data_batch in enumerate(self.data_loader):
  File "/root/miniconda3/lib/python3.8/site-packages/torchdata/dataloader2/dataloader2.py", line 209, in __iter__
    self.datapipe = self.reading_service.initialize(self.datapipe)
  File "/root/miniconda3/lib/python3.8/site-packages/torchdata/dataloader2/reading_service.py", line 487, in initialize
        self.datapipe = self.reading_service.initialize(self.datapipe)
  File "/root/miniconda3/lib/python3.8/site-packages/torchdata/dataloader2/reading_service.py", line 446, in initialize
    self._pg = dist.new_group(backend="gloo", timeout=timedelta(seconds=self._timeout))
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3505, in new_group
        self.datapipe = self.reading_service.initialize(self.datapipe)
  File "/root/miniconda3/lib/python3.8/site-packages/torchdata/dataloader2/reading_service.py", line 446, in initialize
    self._pg = dist.new_group(backend="gloo", timeout=timedelta(seconds=self._timeout))
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3505, in new_group
    pg = _new_process_group_helper(
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:8545: Connection refused
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
    RuntimeError    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout): backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:40808: Connection refused

RuntimeErrorRuntimeError: : [/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:22354: Connection refused[/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:39197: Connection refused

    pg = _new_process_group_helper(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 994, in _new_process_group_helper
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:1373: Connection refused
        pg = _new_process_group_helper
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 994, in _new_process_group_helper
        backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:47407: Connection refusedRuntimeError
: [/opt/conda/conda-bld/pytorch_1678402379298/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:3895: Connection refused

My datapipes:

from torchdata.dataloader2 import DataLoader2 
from torchdata.dataloader2.reading_service import DistributedReadingService, MultiProcessingReadingService, SequentialReadingService

files = sorted(glob.glob("{}/*.tar".format(rootdir)))
datapipe = dp.iter.FileLister(files)
datapipe = datapipe.shuffle().sharding_filter()
datapipe = dp.iter.FileOpener(datapipe, mode="rb")
datapipe = datapipe.load_from_tar(length=length).webdataset()
datapipe = datapipe.shuffle(buffer_size=buffer_size) 

datapipe = datapipe.map(postprocess_func)
datapipe = datapipe.batch(batch_size=batch_size).collate()
mp_rs = MultiProcessingReadingService(num_workers=num_workers)
dist_rs = DistributedReadingService()rs = SequentialReadingService(dist_rs, mp_rs) 

data_loader = DataLoader2(datapipe, reading_service = rs )

I tried: torchdata on single-node（1x8A100） works old DataLoader wo datapipe on 2x8A100 cluster works torchdata on 2x8A100 cluster tcp error

Versions

Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] torch==2.0.0 [pip3] torchdata==0.6.0 [pip3] torchmetrics==0.7.3 [pip3] torchscale==0.2.0 [pip3] torchvision==0.15.0 [pip3] triton==2.0.0 [conda] blas 1.0 mkl [conda] cudatoolkit 11.7.0 hd8887f6_10 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 12_linux64_mkl conda-forge [conda] liblapack 3.9.0 12_linux64_mkl conda-forge [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py38h7f8727e_0 [conda] mkl_fft 1.3.1 py38hd3c417c_0 [conda] mkl_random 1.2.2 py38h51133e4_0 [conda] numpy 1.23.5 py38h14f4228_0 [conda] numpy-base 1.23.5 py38h31eccc5_0 [conda] pytorch 2.0.0 py3.8_cuda11.7_cudnn8.5.0_0 pytorch [conda] pytorch-cuda 11.7 h778d358_3 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchdata 0.6.0 py38 pytorch [conda] torchscale 0.2.0 pypi_0 pypi [conda] torchtriton 2.0.0 py38 pytorch

ejguan commented 1 year ago

There is a high chance that your nodes are set to localhost for connection. You might set GLOO_SOCKET_IFNAME to make multi-node training working (example like GLOO_SOCKET_IFNAME=eth0)

rxqy commented 1 year ago

Many thanks! Adding an extra line export GLOO_SOCKET_IFNAME=eth0 in my launching script fixes my issue.

Adenialzz commented 1 year ago

Hi, how can I determine what value my GLOO_SOCKET_IFNAME environment variable should be set to?

pytorch / data

tcp connection refused when multinode training #1142

🐛 Describe the bug

Versions