Closed rxqy closed 1 year ago
There is a high chance that your nodes are set to localhost for connection. You might set GLOO_SOCKET_IFNAME
to make multi-node training working (example like GLOO_SOCKET_IFNAME=eth0
)
Many thanks! Adding an extra line export GLOO_SOCKET_IFNAME=eth0
in my launching script fixes my issue.
Hi, how can I determine what value my GLOO_SOCKET_IFNAME
environment variable should be set to?
🐛 Describe the bug
Hi, we are training in webdataset format with torchdata. Everything works fine on a single-node machine. We then move to a multi-node cluster and would have the following error. I manually cleaned up some errors from the other ranks, so please tell me if I missed anything important.
My datapipes:
I tried: torchdata on single-node(1x8A100) works old DataLoader wo datapipe on 2x8A100 cluster works torchdata on 2x8A100 cluster tcp error
Versions
Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] torch==2.0.0 [pip3] torchdata==0.6.0 [pip3] torchmetrics==0.7.3 [pip3] torchscale==0.2.0 [pip3] torchvision==0.15.0 [pip3] triton==2.0.0 [conda] blas 1.0 mkl [conda] cudatoolkit 11.7.0 hd8887f6_10 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 12_linux64_mkl conda-forge [conda] liblapack 3.9.0 12_linux64_mkl conda-forge [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py38h7f8727e_0 [conda] mkl_fft 1.3.1 py38hd3c417c_0 [conda] mkl_random 1.2.2 py38h51133e4_0 [conda] numpy 1.23.5 py38h14f4228_0 [conda] numpy-base 1.23.5 py38h31eccc5_0 [conda] pytorch 2.0.0 py3.8_cuda11.7_cudnn8.5.0_0 pytorch [conda] pytorch-cuda 11.7 h778d358_3 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchdata 0.6.0 py38 pytorch [conda] torchscale 0.2.0 pypi_0 pypi [conda] torchtriton 2.0.0 py38 pytorch