pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:370 : Failed to meet rendezvous 'test_accuracy': Trying to connect an http1.x server (14) #3126

Closed ZhongYFeng closed 3 years ago

ZhongYFeng commented 3 years ago

❓ Questions and Help

When I run /pytorch/torch-xla/test/test_train_mp_mnist.py, I get an exception. What is the cause of this problem? python test_train_mp_mnist.py --num_cores=3 | Training Device=xla:0/0 Step=0 Loss=2.39107 Rate=27.37 GlobalRate=27.37 Time=11:05:33 | Training Device=xla:0/1 Step=0 Loss=2.39420 Rate=28.01 GlobalRate=28.01 Time=11:05:33 | Training Device=xla:0/2 Step=0 Loss=2.42805 Rate=27.62 GlobalRate=27.62 Time=11:05:33 | Training Device=xla:0/1 Step=20 Loss=1.86752 Rate=1280.14 GlobalRate=465.08 Time=11:05:34 | Training Device=xla:0/0 Step=20 Loss=1.99102 Rate=1254.40 GlobalRate=454.68 Time=11:05:34 | Training Device=xla:0/2 Step=20 Loss=1.95427 Rate=1274.89 GlobalRate=459.45 Time=11:05:34 | Training Device=xla:0/1 Step=40 Loss=1.54216 Rate=6766.20 GlobalRate=871.00 Time=11:05:34 | Training Device=xla:0/0 Step=40 Loss=1.51056 Rate=6721.72 GlobalRate=852.12 Time=11:05:34 | Training Device=xla:0/2 Step=40 Loss=1.47586 Rate=6665.45 GlobalRate=860.33 Time=11:05:34 | Training Device=xla:0/1 Step=60 Loss=1.13522 Rate=8958.13 GlobalRate=1245.11 Time=11:05:35 | Training Device=xla:0/0 Step=60 Loss=1.07370 Rate=8975.93 GlobalRate=1219.42 Time=11:05:35 | Training Device=xla:0/2 Step=60 Loss=1.15932 Rate=8869.88 GlobalRate=1230.08 Time=11:05:35 | Training Device=xla:0/1 Step=80 Loss=0.82879 Rate=9760.77 GlobalRate=1590.29 Time=11:05:35 | Training Device=xla:0/0 Step=80 Loss=0.82269 Rate=9856.13 GlobalRate=1559.52 Time=11:05:35 | Training Device=xla:0/2 Step=80 Loss=0.82739 Rate=9663.45 GlobalRate=1571.21 Time=11:05:35 | Training Device=xla:0/0 Step=100 Loss=0.66265 Rate=10221.77 GlobalRate=1875.58 Time=11:05:35 | Training Device=xla:0/1 Step=100 Loss=0.73500 Rate=9654.47 GlobalRate=1904.90 Time=11:05:35 | Training Device=xla:0/2 Step=100 Loss=0.72223 Rate=9613.25 GlobalRate=1882.91 Time=11:05:35 | Training Device=xla:0/0 Step=120 Loss=0.46007 Rate=10375.11 GlobalRate=2170.06 Time=11:05:35 | Training Device=xla:0/1 Step=120 Loss=0.57450 Rate=9590.68 GlobalRate=2195.38 Time=11:05:35 | Training Device=xla:0/2 Step=120 Loss=0.50122 Rate=10086.75 GlobalRate=2177.71 Time=11:05:35 | Training Device=xla:0/0 Step=140 Loss=0.40821 Rate=10433.10 GlobalRate=2445.00 Time=11:05:36 | Training Device=xla:0/1 Step=140 Loss=0.56258 Rate=9817.09 GlobalRate=2468.39 Time=11:05:36 | Training Device=xla:0/2 Step=140 Loss=0.46462 Rate=10155.16 GlobalRate=2451.17 Time=11:05:36 Epoch 1 train end 11:05:37 mesh_service_address: bms-v100-p00526411-0010:42921 2021-09-14 11:05:40.341023: I tensorflow/compiler/xla/xla_client/mesh_service.cc:318] Waiting to connect to client mesh master (300 seconds) bms-v100-p00526411-0010:42921 mesh_service_address: bms-v100-p00526411-0010:42921 2021-09-14 11:05:40.463757: I tensorflow/compiler/xla/xla_client/mesh_service.cc:318] Waiting to connect to client mesh master (300 seconds) bms-v100-p00526411-0010:42921 Exception in device=GPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:370 : Failed to meet rendezvous 'test_accuracy': Trying to connect an http1.x server (14) Traceback (most recent call last): File "/root/miniconda3/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/root/miniconda3/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 328, in _start_fn fn(gindex, *args) File "/mnt/zyf/pytorch/torch-xla/test/test_train_mp_mnist.py", line 179, in _mp_fn accuracy = train_mnist(flags) File "/mnt/zyf/pytorch/torch-xla/test/test_train_mp_mnist.py", line 160, in train_mnist accuracy = test_loop_fn(test_device_loader) File "/mnt/zyf/pytorch/torch-xla/test/test_train_mp_mnist.py", line 149, in test_loop_fn accuracy = xm.mesh_reduce('test_accuracy', accuracy, np.mean) File "/root/miniconda3/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 923, in mesh_reduce xdata = rendezvous(tag, bio.getvalue()) File "/root/miniconda3/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas) RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:370 : Failed to meet rendezvous 'test_accuracy': Trying to connect an http1.x server (14)

Environment

Collecting environment information... PyTorch version: 1.10.0a0+gitdca97b4 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.21.1 Libc version: glibc-2.9

Python version: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) [GCC 7.2.0] (64-bit runtime) Python platform: Linux-4.4.0-21-generic-x86_64-with-debian-buster-sid Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: NVIDIA Tesla V100-SXM2-32GB GPU 1: NVIDIA Tesla V100-SXM2-32GB GPU 2: NVIDIA Tesla V100-SXM2-32GB GPU 3: NVIDIA Tesla V100-SXM2-32GB

Nvidia driver version: 465.19.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.15.4 [pip3] torch==1.10.0a0+gitdca97b4 [pip3] torch-xla==1.10 [pip3] torchvision==0.11.0a0+e1f22ed [conda] mkl-include 2021.3.0 [conda] numpy 1.15.4 [conda] torch 1.10.0a0+gitdca97b4 [conda] torch-xla 1.10 [conda] torchvision 0.11.0a0+e1f22ed

JackCaoG commented 3 years ago

HI, we don't really support --num_cores=3, for a v3-8 and v2-8, we only support running with 1 core or 8 cores. Can your use the updated config and retry?

ZhongYFeng commented 3 years ago

The problem caused by the proxy in the environment is solved after turning off the proxy.