: MADV_DONTNEED does not work (memset will be used instead)
: (This is the expected behaviour if you are running under QEMU)
work0 process init!
[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Traceback (most recent call last):
File "/src/infer.py", line 262, in
rpc.init_rpc("worker0",
File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 200, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 233, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 104, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 324, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File "/opt/python-occlum/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 112, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Error: 256
# Expected behavior
These two rpc processes should be able to connect successfully.
# Environment
- HW: SGX2
- OS: Ubuntu20.04
- Occlum version: the docker image: occlum/occlum:latest-ubuntu20.04
Describe the bug
I got this problem when I ran an RPC program with the PyTorch framework in occlum.
root@tee-node35:/home/llm/RpcLLM/occlum_instance# occlum exec /bin/python3 /src/infer.py --rank 0