init_rpc fails on the latest build due to tensorpipe

swd543 commented 3 years ago

I am trying to run an rpc init example on a fair cluster with one machine and two workers. On running the code, both the processes throw the same error -

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    rpc.init_rpc("worker0", rank=0, world_size=2)
  File "/home/wi743774/.conda/envs/itorch/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/home/wi743774/.conda/envs/itorch/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 243, in _init_rpc_backend
    rpc_backend_options=rpc_backend_options,
  File "/home/wi743774/.conda/envs/itorch/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/home/wi743774/.conda/envs/itorch/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 313, in _tensorpipe_init_backend_handler
    api._init_rpc_states(agent)
  File "/home/wi743774/.conda/envs/itorch/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 117, in _init_rpc_states
    _set_and_start_rpc_agent(agent)
RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument

I narrowed it down by building the latest version of tensorpipe on the cluster and running tests. Specifically with Ibv the tests fail.

[----------] 11 tests from Ibv/TransportTest
[ RUN      ] Ibv/TransportTest.Context_Basics/0
unknown file: Failure
C++ exception with description "In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument" thrown in the test body.
[  FAILED  ] Ibv/TransportTest.Context_Basics/0, where GetParam() = 0xb1ae88 (3 ms)
[ RUN      ] Ibv/TransportTest.Context_DomainDescriptor/0
unknown file: Failure
C++ exception with description "In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument" thrown in the test body.
[  FAILED  ] Ibv/TransportTest.Context_DomainDescriptor/0, where GetParam() = 0xb1ae88 (0 ms)
[ RUN      ] Ibv/TransportTest.Connection_Initialization/0
terminate called after throwing an instance of 'std::system_error'
  what():  In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument
zsh: abort (core dumped)  ./tensorpipe/test/tensorpipe_test

Specifically this line

inline IbvSharedReceiveQueue createIbvSharedReceiveQueue(
    const IbvLib& ibvLib,
    IbvProtectionDomain& pd,
    IbvLib::srq_init_attr& initAttr) {
  return IbvSharedReceiveQueue(
  >>>    TP_CHECK_IBV_PTR(ibvLib.create_srq(pd.get(), &initAttr)),
      IbvSharedReceiveQueueDeleter{&ibvLib});
}

What I tried

setting GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME, tests still fail

Tried issuing nslookup $(hostname) -


Server:         134.130.4.1
Address:        134.130.4.1#53

Non-authoritative answer: Name: login18.hpc.itc.uk Address: 134.61.193.185

- Tried all the interfaces, eth0, 1 and lo
- Tried MASTER_ADDR as localhost and resolved IP address
- running a single node, single process world, same error

### Environment
- pytorch 1.9.1
- cuda 11.2
- gcc 11.2
- CentOS Linux release 7.9.2009 (Core)
- python 3.7

### Test code

Process 1
```python3
import torch
from torch.distributed import rpc
from os import environ

environ['MASTER_ADDR'] = 'localhost'
environ['MASTER_PORT'] = '29567'
environ['CUDA_VISIBLE_DEVICES'] = '0'
environ['TP_SOCKET_IFNAME']='lo'

rpc.init_rpc("worker0", rank=0, world_size=2)
ret = rpc.rpc_sync("worker1", torch.add, args=(torch.ones(2), 3))
print(ret)
rpc.shutdown()

Process 2

import torch.distributed.rpc as rpc
from os import environ

environ['MASTER_ADDR'] = 'localhost'
environ['MASTER_PORT'] = '29567'
environ['CUDA_VISIBLE_DEVICES'] = '1'
environ['TP_SOCKET_IFNAME']='lo'

rpc.init_rpc("worker1", rank=1, world_size=2)
rpc.shutdown()

Please let me know if I am doing something stupendously erroneous or how to further debug this issue. Since I am running on a managed cluster, I am not able to change system packages or issue sudo. I see from your circleci tests that none of them have this error, so I assume it is very likely an issue specific to my cluster.

lw commented 3 years ago

Are you running on AWS? What does ibstat output?

swd543 commented 3 years ago

No, I am not running on AWS, but a cluster managed by my university. ibstat returns the following -

CA 'hfi1_0'
        CA type: 
        Number of ports: 1
        Firmware version: 1.27.0
        Hardware version: 11
        Node GUID: 0x001175090105651e
        System image GUID: 0x001175090105651e
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 1068
                LMC: 0
                SM lid: 27
                Capability mask: 0x00490020
                Port GUID: 0x001175090105651e
                Link layer: InfiniBand

lw commented 3 years ago

Your interface's name, hfi, seems to be an "Intel® Omni-Path Host Fabric Interface Adapters". This is literally the first time I hear about such a device. Though it seems it suffers from the same issues that affect the EFA devices on AWS: it claims it can be used as an InfiniBand device, and this "tricks" TensorPipe into trying to using it as such, but it doesn't support some of the features that TensorPipe requires.

You can find a similar issue in https://github.com/pytorch/pytorch/issues/65022, where we also suggest a workaround until we fix this autodetection logic.

swd543 commented 3 years ago

@lw Thanks, your workaround worked. I changed my code to the following -

environ['TP_SOCKET_IFNAME']='lo'

rpc.init_rpc("worker0", rank=0, world_size=2, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
        init_method=f"file:///tmp/b.txt",
        _transports=["uv"],
))

Both processes use the same file. This really is weird as the interface should not claim to be an infiniband device while not being able to support all the features. Maybe it is just too old? Is it possible to bypass the interface altogether and use loopback for speed? I do not need multiple nodes.

lw commented 3 years ago

To be fair, it's an optional feature that TensorPipe is using, hence it's "acceptable" to not support it, though all the actual InfiniBand devices we saw did support it.

If all your processes are on the same node, you can use _transports=["shm", "uv"], which will add another backend optimized for same-machine.

swd543 commented 3 years ago

Thank you! If you are planning to support this device soon I would be very glad to contribute in debugging. Maybe this would also solve this issue across AWS machines too.

pytorch / tensorpipe

init_rpc fails on the latest build due to tensorpipe #413

What I tried