Open swd543 opened 3 years ago
Are you running on AWS? What does ibstat
output?
No, I am not running on AWS, but a cluster managed by my university. ibstat
returns the following -
CA 'hfi1_0'
CA type:
Number of ports: 1
Firmware version: 1.27.0
Hardware version: 11
Node GUID: 0x001175090105651e
System image GUID: 0x001175090105651e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 1068
LMC: 0
SM lid: 27
Capability mask: 0x00490020
Port GUID: 0x001175090105651e
Link layer: InfiniBand
Your interface's name, hfi
, seems to be an "Intel® Omni-Path Host Fabric Interface Adapters". This is literally the first time I hear about such a device. Though it seems it suffers from the same issues that affect the EFA devices on AWS: it claims it can be used as an InfiniBand device, and this "tricks" TensorPipe into trying to using it as such, but it doesn't support some of the features that TensorPipe requires.
You can find a similar issue in https://github.com/pytorch/pytorch/issues/65022, where we also suggest a workaround until we fix this autodetection logic.
@lw Thanks, your workaround worked. I changed my code to the following -
environ['TP_SOCKET_IFNAME']='lo'
rpc.init_rpc("worker0", rank=0, world_size=2, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
init_method=f"file:///tmp/b.txt",
_transports=["uv"],
))
Both processes use the same file. This really is weird as the interface should not claim to be an infiniband device while not being able to support all the features. Maybe it is just too old? Is it possible to bypass the interface altogether and use loopback for speed? I do not need multiple nodes.
To be fair, it's an optional feature that TensorPipe is using, hence it's "acceptable" to not support it, though all the actual InfiniBand devices we saw did support it.
If all your processes are on the same node, you can use _transports=["shm", "uv"],
which will add another backend optimized for same-machine.
Thank you! If you are planning to support this device soon I would be very glad to contribute in debugging. Maybe this would also solve this issue across AWS machines too.
I am trying to run an rpc init example on a fair cluster with one machine and two workers. On running the code, both the processes throw the same error -
I narrowed it down by building the latest version of tensorpipe on the cluster and running tests. Specifically with Ibv the tests fail.
Specifically this line
What I tried
GLOO_SOCKET_IFNAME
andTP_SOCKET_IFNAME
, tests still failnslookup $(hostname)
-Non-authoritative answer: Name: login18.hpc.itc.uk Address: 134.61.193.185
Process 2
Please let me know if I am doing something stupendously erroneous or how to further debug this issue. Since I am running on a managed cluster, I am not able to change system packages or issue sudo. I see from your circleci tests that none of them have this error, so I assume it is very likely an issue specific to my cluster.