twitter-archive / torch-ipc

A set of primitives for parallel computation in Torch
Apache License 2.0
95 stars 28 forks source link

Issues running examples #40

Closed juesato closed 7 years ago

juesato commented 7 years ago

Hi,

Thanks for releasing this library - it looks awesome! I have the same issue mentioned at https://github.com/twitter/torch-ipc/issues/26, when I run allgpu-allreduce: but it occurs even without other jobs running.

t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ th allgpu-allreduce.lua 
Found 8 GPUs, forking children...   
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU7
INFO: torch-ipc: CUDA IPC not possible between GPU7 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU6
INFO: torch-ipc: CUDA IPC not possible between GPU6 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU3
INFO: torch-ipc: CUDA IPC enabled between GPU3 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU2
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU5
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU4
INFO: torch-ipc: CUDA IPC not possible between GPU5 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU4 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU2 and GPU0
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 336): (9, Bad file descriptor)

stack traceback:
    [C]: in function 'client'
    ...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: in function 'LocalhostTree'
    allgpu-allreduce.lua:39: in main chunk
    [C]: in function 'dofile'
    ...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 446): (server timed out waiting for clients to connect)

stack traceback:
    [C]: in function 'clients'
    ...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: in function 'initialServer'
    ...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:136: in function 'LocalhostTree'
    allgpu-allreduce.lua:39: in main chunk
    [C]: in function 'dofile'
    ...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

I'm able to run examples/allreduce.lua with the CUDA option, and it doesn't hang.

The second issue is that in both cases, not all of the GPUs allow CUDA IPC. This appears similar to https://github.com/twitter/torch-ipc/issues/17 but I don't seem to have APC enabled, based on running commands, like

t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 07:00.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 08:08.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 08:10.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 0b:00.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 0c:08.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 0c:10.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 21:00.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 22:01.0 -vvvv|grep -i acs
juesato commented 7 years ago

Sorry my bad, the error messages make total sense here - GPUs 0-3 are connected, and GPUs 4-7 are connected, but not between the two. Closing.