Closed Sunnydreamrain closed 7 years ago
Did you install nccl before compiling libgpuarray and pygpu? If not, reinstall it from scratch, including calling cmake. Check the log to make sure it detect nccl.
Make sure to use CUDA and not OpenCL.
On Sun, Sep 25, 2016 at 4:44 AM, Sunnydreamrain notifications@github.com wrote:
Hi all,
I tried to run the new version of Platoon. It gives the following error.
WARNING! Failed to register in a local GPU comm world. Reason: No collective ops available, API error. Is a collectives library installed?
I think it is because of this line. self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx)
I have installed pygpu and nccl already.
Any idea?
Thanks a lot.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-6SKIZIq7th6LXh_HLrnQCqtjgqGks5qtjRkgaJpZM4KF2pf .
I reinstalled the nccl, libgpuarray and pygpu. It works now. Thanks a lot.
Hi, I reinstalled nccl, libgpuarray, pygpu, platoon one by one. and it still not fixed. can anybody give me some advice?
I couldn't remember all the details on this, but I think it is related to the path of system or python (LD_LIBRARY_PATH, or PYTHONPATH). When you install each one of them, make sure others can detect them or when you run your program, try to add necessary paths.
Thank you for your reply @Sunnydreamrain ! I added the /usr/local/nccl/lib
to LD_LIBRARY_PATH
, the PYTHONPATH is correct, then I installed libgpuarray, pygpu, but it still doesn't work.
BTW, in a single-node
scenario, there's not such error, while multi-node
scenario is.
Could you check if path variables are indeed correctly set in each node? Also (just as a reminder), is NCCL installed in both nodes?
Shanbo, I have an idea. Maybe line 128 from platoon_launcher is at fault. Please comment it out and check if path variables are ok. I believe that it messes up path variables for other nodes. You could probably keep though the THEANO_FLAGS
variable.
command += shlex.split(" -x " + theano_flags)
Hi Christos,
I commented that line and kept THEANO_FLAGS
, then ran the single-node
version, then an exception /usr/bin/ld: cannot find -lcudnn
occured.
Then I did some tests, and kept THEANO_FLAGS
, PATH
, LD_LIBRARY_PATH
and
PLATOON_TEST_WORKER_NUM
to make the single-node
version work, while the
multi-node
version still raise the same exception.
The example/lstm
works fine on each node in a single-node scenario, separately.
Thank you. Shanbo
2017-03-09 4:28 GMT+08:00 Christos Tsirigotis notifications@github.com:
Shanbo, I have an idea. Maybe line 128 https://github.com/cshanbo/platoon/blob/fix/multi-node/scripts/platoon-launcher#L128 from platoon_launcher is at fault. Please comment it out and check if path variables are ok. I believe that it messes up path variables for other nodes. You could probably keep though the THEANO_FLAGS variable.
command += shlex.split(" -x " + theano_flags)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/issues/72#issuecomment-285158956, or mute the thread https://github.com/notifications/unsubscribe-auth/AJM_FUG_KrUGfRj_iT_eRcQp8LSY3iFSks5rjw9TgaJpZM4KF2pf .
I commented that line and kept
THEANO_FLAGS
, then ran the single-node version, then an exception/usr/bin/ld: cannot find -lcudnn
occured.
Single-node scenario should not access those codes, are you sure that you were running in single-node scenario? I want to check something out considering environmentals in openMPI and I will fix this. Could you please check the comments I made at tsirif/platoon#3, so I can merge and take into account your changes too?
Thank you, Christos
Hi,
I thought I ran a single-node experiment but in fact, a multi-node experiment, because I used .platoonrc
in platoon-launcher
, without setting the argument --multi
.
Hi all,
I tried to run the new version of Platoon. It gives the following error.
WARNING! Failed to register in a local GPU comm world. Reason: No collective ops available, API error. Is a collectives library installed?
I think it is because of this line. self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx)
I have installed pygpu and nccl already.
Any idea?
Thanks a lot.