occlum / occlum

Occlum is a memory-safe, multi-process library OS for Intel SGX
https://occlum.io/
Other
1.37k stars 232 forks source link

[BUG] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error #1604

Open Grace-byte912 opened 1 month ago

Grace-byte912 commented 1 month ago

Describe the bug

I got this problem when I ran an RPC program with the PyTorch framework in occlum.

root@tee-node35:/home/llm/RpcLLM/occlum_instance# occlum exec /bin/python3 /src/infer.py --rank 0

: MADV_DONTNEED does not work (memset will be used instead) : (This is the expected behaviour if you are running under QEMU) work0 process init! [E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error Traceback (most recent call last): File "/src/infer.py", line 262, in rpc.init_rpc("worker0", File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 200, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 233, in _init_rpc_backend rpc_agent = backend_registry.init_backend( File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 104, in init_backend return backend.value.init_backend_handler(*args, **kwargs) File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 324, in _tensorpipe_init_backend_handler group = _init_process_group(store, rank, world_size) File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 112, in _init_process_group group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout) RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error Error: 256 # Expected behavior These two rpc processes should be able to connect successfully. # Environment - HW: SGX2 - OS: Ubuntu20.04 - Occlum version: the docker image: occlum/occlum:latest-ubuntu20.04
qzheng527 commented 3 weeks ago

@Grace-byte912 Please enable Occlum trace or debug log for further investigation.