pytorch / tensorpipe

A tensor-aware point-to-point communication primitive for machine learning
Other
248 stars 77 forks source link

"errorCouldn't get list of InfiniBand devices: ibv_get_device_list: Unknown error -38" #306

Closed jeremysalwen closed 3 years ago

jeremysalwen commented 3 years ago

With pytorch 1.7.1, I was able to successfully initialize the RPC context.

With pytorch nightly (1.9.0.dev20210223),

  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/hearthstone/training/pytorch/worker/distributed/worker_pool.py", line 54, in __init__
    rpc.init_rpc(INFERENCE_PROCESS_NAME, rank=0, world_size=num_workers+1)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 194, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 230, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 278, in _tensorpipe_init_backend_handler
    api._init_rpc_states(agent)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 116, in _init_rpc_states
    _set_and_start_rpc_agent(agent)
RuntimeError: In create at /pytorch/third_party/tensorpipe/tensorpipe/transport/ibv/context_impl.cc:55 "errorCouldn't get list of InfiniBand devices: ibv_get_device_list: Unknown error -38"

Cuda is working correctly otherwise, e.g. I can run

>>> torch.tensor([1,2], device=torch.device('cuda'))

successfully.

lw commented 3 years ago

Thanks for reporting this. This is a known issue, which we've already fixed in https://github.com/pytorch/tensorpipe/commit/0f7673ba421928490deeb35a35a01605d3d3273a. We've not yet currently updated TensorPipe's submodule in PyTorch, which is why you're still seeing it in the nightlies. We expect to be able to do so by the end of the week.

Note that a "workaround" would be to update the version of libibverbs on your machine, if that's an option. Anything after v25 (inclusive) should work.