Thanks for releasing this library - it looks awesome! I have the same issue mentioned at https://github.com/twitter/torch-ipc/issues/26, when I run allgpu-allreduce: but it occurs even without other jobs running.
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ th allgpu-allreduce.lua
Found 8 GPUs, forking children...
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU7
INFO: torch-ipc: CUDA IPC not possible between GPU7 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU6
INFO: torch-ipc: CUDA IPC not possible between GPU6 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU3
INFO: torch-ipc: CUDA IPC enabled between GPU3 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU2
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU5
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU4
INFO: torch-ipc: CUDA IPC not possible between GPU5 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU4 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU2 and GPU0
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 336): (9, Bad file descriptor)
stack traceback:
[C]: in function 'client'
...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: in function 'LocalhostTree'
allgpu-allreduce.lua:39: in main chunk
[C]: in function 'dofile'
...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 446): (server timed out waiting for clients to connect)
stack traceback:
[C]: in function 'clients'
...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: in function 'initialServer'
...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:136: in function 'LocalhostTree'
allgpu-allreduce.lua:39: in main chunk
[C]: in function 'dofile'
...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
I'm able to run examples/allreduce.lua with the CUDA option, and it doesn't hang.
The second issue is that in both cases, not all of the GPUs allow CUDA IPC. This appears similar to https://github.com/twitter/torch-ipc/issues/17 but I don't seem to have APC enabled, based on running commands, like
Hi,
Thanks for releasing this library - it looks awesome! I have the same issue mentioned at https://github.com/twitter/torch-ipc/issues/26, when I run
allgpu-allreduce
: but it occurs even without other jobs running.I'm able to run
examples/allreduce.lua
with the CUDA option, and it doesn't hang.The second issue is that in both cases, not all of the GPUs allow CUDA IPC. This appears similar to https://github.com/twitter/torch-ipc/issues/17 but I don't seem to have APC enabled, based on running commands, like