mila-iqia / platoon

Multi-GPU mini-framework for Theano
MIT License
195 stars 41 forks source link

WARNING! Failed to register in a local GPU comm world. Reason: No collective ops available, API error. Is a collectives library installed? #72

Closed Sunnydreamrain closed 7 years ago

Sunnydreamrain commented 7 years ago

Hi all,

I tried to run the new version of Platoon. It gives the following error.

WARNING! Failed to register in a local GPU comm world. Reason: No collective ops available, API error. Is a collectives library installed?

I think it is because of this line. self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx)

I have installed pygpu and nccl already.

Any idea?

Thanks a lot.

nouiz commented 7 years ago

Did you install nccl before compiling libgpuarray and pygpu? If not, reinstall it from scratch, including calling cmake. Check the log to make sure it detect nccl.

Make sure to use CUDA and not OpenCL.

On Sun, Sep 25, 2016 at 4:44 AM, Sunnydreamrain notifications@github.com wrote:

Hi all,

I tried to run the new version of Platoon. It gives the following error.

WARNING! Failed to register in a local GPU comm world. Reason: No collective ops available, API error. Is a collectives library installed?

I think it is because of this line. self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx)

I have installed pygpu and nccl already.

Any idea?

Thanks a lot.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-6SKIZIq7th6LXh_HLrnQCqtjgqGks5qtjRkgaJpZM4KF2pf .

Sunnydreamrain commented 7 years ago

I reinstalled the nccl, libgpuarray and pygpu. It works now. Thanks a lot.

cshanbo commented 7 years ago

Hi, I reinstalled nccl, libgpuarray, pygpu, platoon one by one. and it still not fixed. can anybody give me some advice?

Sunnydreamrain commented 7 years ago

I couldn't remember all the details on this, but I think it is related to the path of system or python (LD_LIBRARY_PATH, or PYTHONPATH). When you install each one of them, make sure others can detect them or when you run your program, try to add necessary paths.

cshanbo commented 7 years ago

Thank you for your reply @Sunnydreamrain ! I added the /usr/local/nccl/lib to LD_LIBRARY_PATH, the PYTHONPATH is correct, then I installed libgpuarray, pygpu, but it still doesn't work.

cshanbo commented 7 years ago

BTW, in a single-node scenario, there's not such error, while multi-node scenario is.

tsirif commented 7 years ago

Could you check if path variables are indeed correctly set in each node? Also (just as a reminder), is NCCL installed in both nodes?

tsirif commented 7 years ago

Shanbo, I have an idea. Maybe line 128 from platoon_launcher is at fault. Please comment it out and check if path variables are ok. I believe that it messes up path variables for other nodes. You could probably keep though the THEANO_FLAGS variable.

command += shlex.split(" -x " + theano_flags)

cshanbo commented 7 years ago

Hi Christos,

I commented that line and kept THEANO_FLAGS, then ran the single-node version, then an exception /usr/bin/ld: cannot find -lcudnn occured. Then I did some tests, and kept THEANO_FLAGS, PATH, LD_LIBRARY_PATH and PLATOON_TEST_WORKER_NUM to make the single-node version work, while the multi-node version still raise the same exception.

The example/lstm works fine on each node in a single-node scenario, separately.

Thank you. Shanbo

2017-03-09 4:28 GMT+08:00 Christos Tsirigotis notifications@github.com:

Shanbo, I have an idea. Maybe line 128 https://github.com/cshanbo/platoon/blob/fix/multi-node/scripts/platoon-launcher#L128 from platoon_launcher is at fault. Please comment it out and check if path variables are ok. I believe that it messes up path variables for other nodes. You could probably keep though the THEANO_FLAGS variable.

command += shlex.split(" -x " + theano_flags)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/issues/72#issuecomment-285158956, or mute the thread https://github.com/notifications/unsubscribe-auth/AJM_FUG_KrUGfRj_iT_eRcQp8LSY3iFSks5rjw9TgaJpZM4KF2pf .

tsirif commented 7 years ago

I commented that line and kept THEANO_FLAGS, then ran the single-node version, then an exception /usr/bin/ld: cannot find -lcudnn occured.

Single-node scenario should not access those codes, are you sure that you were running in single-node scenario? I want to check something out considering environmentals in openMPI and I will fix this. Could you please check the comments I made at tsirif/platoon#3, so I can merge and take into account your changes too?

Thank you, Christos

cshanbo commented 7 years ago

Hi, I thought I ran a single-node experiment but in fact, a multi-node experiment, because I used .platoonrc in platoon-launcher, without setting the argument --multi.