twitter-archive / torch-ipc

A set of primitives for parallel computation in Torch
Apache License 2.0
95 stars 28 forks source link

Sometimes, torch-ipc cannot start successfully #26

Open chienlinhuang1116 opened 8 years ago

chienlinhuang1116 commented 8 years ago

Hi,

I want to run 6 GPUs which will start 6 luajit jobs. However, the system only starts 5 GPUs sometimes. Currently, I will restart the training at this moment. Do you have any idea?

Thank you, Chien-Lin

chienlinhuang1116 commented 8 years ago

Hi,

We found the reason is because of "/ipc/DiscoveredTree.lua:15: ERROR: (/home/chienh/big/twitter/torch-ipc/src/cliser.c, 318): (9, Bad file descriptor)".

And, this error only happens when the server is busy on other jobs. Do you have any idea?

Thank you, Chien-Lin