Closed illumidas-agn closed 4 years ago
Hey @illumidas-agn, have you taken a look at the Conda install guide here?
If you're still having issues after going through that, feel free to provide log output showing where things are breaking.
Hi @tgaddair, Im gonna go through that and let you know how it goes
Im trying to run this from a server where I am not the sudo user, is there any alternatives using pip or conda install?
Where are you running into permissions issues? @davidrpugh, do you have some thoughts on this?
If you don't need to use conda, you can also opt to install everything through pip in a virtual environment. But the important thing is you'll need to make sure the CUDA devtools are available when building in NCCL support. If this is an issue, you may want to see if you can run in a containerized environment.
I need to be able to run the CUDA Toolkit as sudo and sadly im not an admin, therefore I cant install it the conventional way. I was able to make it run on my local virtual environment and then when I transfered it into the server that's when I ran into issues. Tried the entire day to get horovod and cuda to run.
I can ask the admin of the server to install it if theres no quick fix for this
Either that or setting up Docker/Singularity would probably be the easiest way, yes. It's certainly possible to install locally (see: https://stackoverflow.com/questions/39379792/install-cuda-without-root), but managing the correct environment variables will likely be a challenge.
I see, in that case ill contact the server admin and ill get back to you if everything works. Thank you for your help
@illumidas-agn you should be able to install Horovod using Conda without root privileges. You will need to use the cudatoolkit-dev=10.1
package from Conda Forge channel. The environment file below should work (you will still need to other files referenced in the Conda install guide).
name: null
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cmake=3.16
- cudatoolkit-dev=10.1
- cudnn=7.6
- cupti=10.1
- cxx-compiler=1.1
- jupyterlab=2.1
- matplotlib=3.2
- mpi4py=3.0 # installs cuda-aware openmpi
- nccl=2.5
- nodejs=13
- pip=20.1
- pip:
- mxnet-cu101mkl==1.6.* # makes sure installed prior to horovod
- -r file:requirements.txt
- python=3.7
- pytorch=1.5
- tensorboard=2.1
- tensorflow-gpu=2.1
- torchvision=0.6
Note that I have bumped a lot of version numbers from what is in the current guide. @tgaddair I will test this and then update the install guide accordingly. Perhaps an explicit indication that you will need to use the cudatoolkit-dev
approach if you don't have permissions to install CUDA toolkit as root.
The project im trying to run requires python 3.6, but I will see whether 3.7 works, will keep you updated
@illumidas-agn Then just change the Python version to 3.6. Shouldn't impact the build.
I see, I just see it listed in the requirements that you have posted above
@illumidas-agn I pin the version numbers in my environment file to the most recent versions of the various dependencies for which I am able to get a successful build. Other combinations of version numbers may also work just fine. Also note that if you only need TensorFlow then you can probably get by with the following environment file which should build more quickly.
name: null
channels:
- conda-forge
- defaults
dependencies:
- cudatoolkit-dev=10.1
- cudnn=7.6
- cupti=10.1
- cxx-compiler=1.1
- jupyterlab=2.1
- matplotlib=3.2
- mpi4py=3.0 # installs cuda-aware openmpi
- nccl=2.5
- nodejs=13
- pip=20.1
- pip:
- -r file:requirements.txt
- python=3.7 # python=3.6 should also work!
- tensorboard=2.1
- tensorflow-gpu=2.1
In the above environment file I have dropped the PyTorch and MXnet dependencies.
Perfect, currently installing all the packages as we speak
Installed all the packages, still getting errors when I use this command:
HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod[tensorflow]
Error at the end of log:
" Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
"
Did you activate the Conda environment prior to running pip
?
Yes, running it in the environment (Should be denoted by (envName) xxx@xxx)
After activating the Conda environment run the command conda list
and share the output.
_libgcc_mutex 0.1 main
_tflow_select 2.1.0 gpu
absl-py 0.9.0 py36_0
astor 0.8.0 py36_0
astor 0.8.1
atari-py 0.2.6
backcall 0.2.0 py_0
binutils 2.33.1 h53a641e_8 conda-forge
binutils_impl_linux-64 2.33.1 he1b5a44_7 conda-forge
binutils_linux-64 2.33.1 h9595d00_17 conda-forge
blas 1.0 mkl
bleach 3.1.5 py_0
bleach 1.5.0
bokeh 2.1.1
c-ares 1.15.0 h7b6447c_1001
c-compiler 1.1.1 h516909a_0 conda-forge
ca-certificates 2020.6.24 0
cachetools 4.1.0 py_1
certifi 2020.6.20 py36_0
cffi 1.14.0 py36he30daa8_1
chardet 3.0.4 py36_1003
click 7.1.2 py_0
cloog 0.18.0 0
cloudpickle 1.3.0
cudatoolkit 10.0.130 0
cudatoolkit-dev 10.1.243 h516909a_3 conda-forge
cudnn 7.6.5 cuda10.0_0
cupti 10.0.130 0
cxx-compiler 1.1.1 hc9558a2_0 conda-forge
cycler 0.10.0 py36_0
dbus 1.13.16 hb2f20db_0
decorator 4.4.2 py_0
defusedxml 0.6.0 py_0
entrypoints 0.3 py36_0
expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
future 0.18.2
gast 0.2.2
gmp 6.1.2 h6c8ec71_1
google-auth 1.17.2 py_0
google-auth-oauthlib 0.4.1 py_2
google-pasta 0.2.0 py_0
grpcio 1.27.2 py36hf8bcb03_0
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
gxx_impl_linux-64 7.3.0 hdf63c60_5 conda-forge
gxx_linux-64 7.3.0 h553295d_17 conda-forge
gym 0.17.2
hdf5 1.10.6 hb1b8bf9_0
html5lib 0.9999999
idna 2.10 py_0
importlib-metadata 1.7.0 py36_0
importlib_metadata 1.7.0 0
intel-openmp 2020.1 217
ipykernel 5.3.3 py36h5ca1d4c_0
ipython 7.16.1 py36h5ca1d4c_0
ipython_genutils 0.2.0 py36_0
isl 0.12.2 0
jedi 0.17.1 py36_0
jinja2 2.11.2 py_0
jpeg 9b h024ee3a_2
json5 0.9.5 py_0
jsonschema 3.2.0 py36_0
jupyter_client 6.1.6 py_0
jupyter_core 4.6.3 py36_0
jupyterlab 2.1.5 py_0
jupyterlab_server 1.2.0 py_0
keras-applications 1.0.8 py_1
keras-preprocessing 1.1.0 py_1
kiwisolver 1.2.0 py36hfd86e86_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc 7.2.0 h69d50b8_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.12.3 hd408876_0
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 9.1.0 hdf63c60_0
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
markdown 3.1.1 py36_0
markupsafe 1.1.1 py36h7b6447c_0
matplotlib 3.2.2 0
matplotlib-base 3.2.2 py36hef1b27d_0
mistune 0.8.4 py36h7b6447c_0
mkl 2020.1 217
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.1.0 py36h23d657b_0
mkl_random 1.1.1 py36h0573a6f_0
mpc 1.0.3 hec55b23_5
mpfr 3.1.5 h11a74b3_2
mpi4py 3.0.3
nbformat 5.0.7 py_0
nccl 1.3.5 cuda10.0_0
ncurses 6.2 he6710b0_1
nodejs 10.13.0 he6710b0_0
notebook 6.0.3 py36_0
numpy 1.18.5 py36ha1c710e_0
numpy-base 1.18.5 py36hde5b4d6_0
oauthlib 3.1.0 py_0
opencv-python 4.3.0.36
opt_einsum 3.1.0 py_0
packaging 20.4 py_0
pandoc 2.10 0
pandocfilters 1.4.2 py36_1
parso 0.7.0 py_0
pcre 8.44 he6710b0_0
pexpect 4.8.0 py36_0
pickleshare 0.7.5 py36_0
Pillow 7.2.0
prometheus_client 0.8.0 py_0
prompt-toolkit 3.0.5 py_0
protobuf 3.12.3 py36he6710b0_0
psutil 5.7.2
pyasn1 0.4.8 py_0
pyasn1-modules 0.2.7 py_0
pycparser 2.20 py_2
pyglet 1.5.0
pyjwt 1.7.1 py36_0
pyopenssl 19.1.0 py_1
pyparsing 2.4.7 py_0
pyqt 5.9.2 py36h05f1152_2
pyrsistent 0.16.0 py36h7b6447c_0
pysocks 1.7.1 py36_0
python 3.6.10 h7579374_2
python-dateutil 2.8.1 py_0
python_abi 3.6 1_cp36m conda-forge
PyYAML 5.3.1
qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
requests 2.24.0 py_0
requests-oauthlib 1.3.0 py_0
rsa 4.0 py_0
scipy 1.5.0 py36h0b6359f_0
send2trash 1.5.0 py36_0
setuptools 49.2.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
tensorboard 2.2.1 pyh532a8cf_0
tensorboard 1.15.0
tensorflow 1.15.2
tensorflow-base 2.0.0 gpu_py36h0ec5d1f_0
tensorflow-estimator 1.15.1
tensorflow-gpu 2.0.0 h0d30ee6_0
tensorflow-tensorboard 1.5.1
terminado 0.8.3 py36_0
testpath 0.4.4 py_0
tk 8.6.10 hbc83047_0
tornado 6.0.4 py36h7b6447c_1
traitlets 4.3.3 py36_0
typing-extensions 3.7.4.2
wcwidth 0.2.5 py_0
webencodings 0.5.1 py36_1
werkzeug 0.16.1 py_0
wheel 0.34.2 py36_0
wrapt 1.12.1 py36h7b6447c_1
xz 5.2.5 h7b6447c_0
zeromq 4.3.2 he6710b0_2
zipp 3.1.0 py_0
zlib 1.2.11 h7b6447c_3
What version of TensorFlow are you trying to use? What changes did you make to the environment file I sketched above? You seem to have two different versions of TensorFlow installed 2.0 and 1.15; as well as different versions of the various CUDA toolkit libraries.
OK. Well for sure to use TensorFlow 1.15 you will need to make changes to the environment file that I suggested above. For one, I think you will need an older version of cudatoolkit-dev
and an older version of nccl
. I think the following environment file should work.
name: null
channels:
- conda-forge
- defaults
dependencies:
- cudatoolkit-dev=10.0
- cudnn=7.6
- cupti=10.0
- cxx-compiler=1.1
- jupyterlab=2.1
- mpi4py=3.0 # installs cuda-aware openmpi
- nccl=2.4
- nodejs=13
- pip=20.1
- python=3.6
- tensorboard=1.15
- tensorflow-gpu=1.15
I forget whether you need to install Keras separately with TensorFlow 1.15 or not.
You will also need to set the following environment variables slightly differently then what is mentioned in the user guide given that you are using the cudatoolkit-dev
approach.
$ export ENV_PREFIX=$PWD/env
$ export HOROVOD_CUDA_HOME=$ENV_PREFIX
$ export HOROVOD_NCCL_HOME=$ENV_PREFIX
$ export HOROVOD_GPU_OPERATIONS=NCCL
Next, create the Conda environment and try building Horovod using the following commands.
conda env create --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
pip install --no-cache-dir horovod==0.19.*
Don't bother explicitly setting HOROVOD_WITH_TENSORFLOW=1
and specifying horovod[tensorflow]
. Just install Horovod and let Horovod determine which bindings to build (given only TF is installed it should only build TensorFlow).
It managed to install almost every package except the last one:
ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::cudatoolkit-dev-10.0-2'.
LinkError: post-link script failed for package conda-forge::cudatoolkit-dev-10.0-2
running your command again with -v
will provide additional information
location of failed script: /home/cogs5/ge/go-explore-master/env/bin/.cudatoolkit-dev-post-link.sh
I used conda list on my local machine where go-explore works and it appears that the cuda tool kit isnt used? This is all very confusing....
_libgcc_mutex 0.1 main
absl-py 0.9.0 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
astor 0.8.1 pypi_0 pypi
atari-py 0.2.6 pypi_0 pypi
baselines 0.1.6 dev_0
ca-certificates 2020.6.24 0
certifi 2020.6.20 py36_0
cffi 1.14.0 pypi_0 pypi
click 7.1.2 pypi_0 pypi
cloudpickle 1.2.2 pypi_0 pypi
cycler 0.10.0 py36_0
dataclasses 0.7 pypi_0 pypi
dbus 1.13.16 hb2f20db_0
decorator 4.4.2 pypi_0 pypi
expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
future 0.18.2 pypi_0 pypi
gast 0.2.2 pypi_0 pypi
glib 2.65.0 h3eb4bd4_0
google-pasta 0.2.0 pypi_0 pypi
grpcio 1.30.0 pypi_0 pypi
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
gym 0.15.7 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
horovod 0.19.5 pypi_0 pypi
icu 58.2 he6710b0_3
imageio 2.9.0 py_0
importlib-metadata 1.7.0 pypi_0 pypi
intel-openmp 2020.1 217
joblib 0.16.0 pypi_0 pypi
jpeg 9b h024ee3a_2
keras-applications 1.0.8 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.2.0 py36hfd86e86_0
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
loky 2.8.0 pypi_0 pypi
lz4-c 1.9.2 he6710b0_0
mako 1.1.3 pypi_0 pypi
markdown 3.2.2 pypi_0 pypi
markupsafe 1.1.1 pypi_0 pypi
matplotlib 3.2.2 0
matplotlib-base 3.2.2 py36hef1b27d_0
mkl 2020.1 217
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.1.0 py36h23d657b_0
mkl_random 1.1.1 py36h0573a6f_0
mpi 1.0 mpich
mpi4py 3.0.3 py36h028fd6f_0
mpich 3.3.2 hc856adb_0
ncurses 6.2 he6710b0_1
numpy 1.18.5 py36ha1c710e_0
numpy-base 1.18.5 py36hde5b4d6_0
nvcc_linux-64 11.0 h4962215_6 nvidia
olefile 0.46 py36_0
opencv-python 4.3.0.36 pypi_0 pypi
openssl 1.1.1g h7b6447c_0
opt-einsum 3.3.0 pypi_0 pypi
pcre 8.44 he6710b0_0
pillow 7.2.0 py36hb39fc2d_0
pip 20.1.1 py36_1
protobuf 3.12.2 pypi_0 pypi
psutil 5.7.2 pypi_0 pypi
pycparser 2.20 pypi_0 pypi
pyglet 1.5.0 pypi_0 pypi
pyparsing 2.4.7 py_0
pyqt 5.9.2 py36h05f1152_2
python 3.6.10 h7579374_2
python-dateutil 2.8.1 py_0
pytools 2020.3.1 pypi_0 pypi
pyyaml 5.3.1 pypi_0 pypi
qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
scipy 1.5.1 pypi_0 pypi
setuptools 49.2.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
tensorboard 1.15.0 pypi_0 pypi
tensorflow 1.15.0 pypi_0 pypi
tensorflow-estimator 1.15.1 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
tk 8.6.10 hbc83047_0
tornado 6.0.4 py36h7b6447c_1
tqdm 4.48.0 pypi_0 pypi
werkzeug 1.0.1 pypi_0 pypi
wheel 0.34.2 py36_0
wrapt 1.12.1 pypi_0 pypi
xz 5.2.5 h7b6447c_0
zipp 3.1.0 pypi_0 pypi
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h0b5b093_0
Looks like TensorFlow is still being installed via pip
from pypi
and not from Conda channels as expected. Are you installing TF by including tensorflow-gpu=1.15
as a dependency in your environment file? Or are you installing TF via pip
?
Getting cudatoolkit-dev
package to work properly is a bit tedious and error prone which is why the Conda guide recommends people install the full CUDA Toolkit themselves and then use nvcc_liunux-64
package to configure the environment, but I guess the standard install of CUDA Toolkit requires root.
There are several builds of cudatoolkit-dev
available on conda-forge
. From your error I can see that you got the second one, but you needed the third.
cudatoolkit-dev 10.0 1 conda-forge
cudatoolkit-dev 10.0 2 conda-forge
cudatoolkit-dev 10.0 py36_0 conda-forge
Replace the cudatoolkit-dev=10.0
with cudatoolkit-dev=10.0=py36_0
to make sure that the specific build number is picked up during install. Try again and let me know if that helps.
@illumidas-agn I have created (and tested) a Horovod build for TensorFlow 1.15 using cudatoolkit-dev=10.0=py36_0
with support for JupyterLab. All of the required config files can be found here. In particular see the bin/create-conda-env.sh
script which automates the environment creation process. Follow the instructions carefully and let me know how you get on!
@tgaddair When I built the environment for Horovod 19.5 setting HOROVOD_GPU_OPERATIONS=NCCL
and then ran horovodrun --check-build
it appeared that NCCL support was not built; when I built the environment using HOROVOD_GPU_ALLREDUCE=NCCL
and HOROVOD_GPU_BROADCAST=NCCL
and then ran horovodrun --check-build
it appeared that NCCL support was built.
Hopefully NCCL support was actually built with HOROVOD_GPU_OPERATIONS=NCCL
and this is just a bug in the horovodrun --check-build
command.
Hey @davidrpugh, unfortunately HOROVOD_GPU_OPERATIONS
was added recently to master and has not been released yet.
I recommend consulting the stable docs (for the latest release) when not building from source:
https://horovod.readthedocs.io/en/stable/summary_include.html#install
Looks like TensorFlow is still being installed via
pip
frompypi
and not from Conda channels as expected. Are you installing TF by includingtensorflow-gpu=1.15
as a dependency in your environment file? Or are you installing TF viapip
?Getting
cudatoolkit-dev
package to work properly is a bit tedious and error prone which is why the Conda guide recommends people install the full CUDA Toolkit themselves and then usenvcc_liunux-64
package to configure the environment, but I guess the standard install of CUDA Toolkit requires root.There are several builds of
cudatoolkit-dev
available onconda-forge
. From your error I can see that you got the second one, but you needed the third.cudatoolkit-dev 10.0 1 conda-forge cudatoolkit-dev 10.0 2 conda-forge cudatoolkit-dev 10.0 py36_0 conda-forge
Replace the
cudatoolkit-dev=10.0
withcudatoolkit-dev=10.0=py36_0
to make sure that the specific build number is picked up during install. Try again and let me know if that helps.
I used Pip, yes, in order to get the specific version.
Im not sure how you mean to install the specific cuda version? Do I just use conda install cudatoolkit-dev=10.0=py36_0? Or do I need to edit the environment.yml file to include this when I build an environment?
EDIT: Got it working below
I apologize for the hassle, im brand new to anaconda and the ML environment.
Im currently following the sh script that you linked in order to try and make the environment, I'll let you know how that goes
Followed the scripts in the directory you showed me, overwrote the environment.yml file according to what was in that directory
Currently its unable to find python3.6 now:
"ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'. FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") Attempting to roll back.
Rolling back transaction: done
FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") "
No worries @illumidas-agn! I am very experienced using Conda (+Pip) to build Python data science and machine learning environments and building the Horovod environment is probably the most difficult build that I have yet encountered. You are definitely jumping into the deep end!
There are three config files and you need ALL three for the environment build to work properly: environment.yml
, requirements.txt
, and postBuild
. The bin/create-conda-env.sh
script shows how these three files are connected. I would advise copying over all three config files and the environment build script into your project directory and then delete any existing environment and run the build script to re-create the environment from scratch. You should not be using conda install
or pip install
to install individual packages one at a time.
I've downloaded the files you've recommended above and put them onto the server. I've removed the environments (except one that I need for another project) and im still running into the same python error:
(base) cogs5@sci-gpu:~/ge/go-explore-master$ ./create-conda-env.sh Solving environment: done
Downloading and Extracting Packages libgfortran-ng 7.5.0: ####################################################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: failed ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'. FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") Attempting to roll back.
Rolling back transaction: done
FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")
What version of conda are you using? If you don’t know then conda info
will tell you.
Try running conda clean --all
which will clean out cruft from all these failed builds and try running the build script again. There are a few more aggressive options that we can try if this doesn’t work.
https://docs.conda.io/projects/conda/en/latest/commands/clean.html
I ran a conda clean and cleared out the crufts
Its giving me the same astor error as previously posted
Conda info yields the following information:
conda version : 4.4.10
conda-build version : 3.4.1
python version : 3.6.4.final.0
I should have asked this question at the very beginning. That conda version is at least two years old. You can try to update In place but I think the best approach would be to uninstall and reinstall.
Are you using the conda as part of the Anaconda Python distribution? Or Miniconda Python distribution?
Im not sure, this version was left on the server sometime ago. Ill contact the admin again and post here once its been updated. Ill retry all the steps and go from there.
I've received a response from the admin of the server stating that:
This machine is in desperate need of an upgrade, there isn't a version of CUDA 10 available for it. You might be better running it on your assigned machine instead as that is much more up to date in comparison.
There is a selection of cuda toolkit versions on sci-gpu, however these are old as no one has requested an additional version in some time.
ll /media/data/cuda
To select the different versions you need to change the following environment variables in your bashrc file. export CUDA_HOME="/media/data/cuda/cuda-8.0-cudnn5"
export PATH="/media/data/cuda/cuda-8.0-cudnn5/bin:$PATH"
export LD_LIBRARY_PATH="/media/data/cuda/cuda-8.0-cudnn5/lib64:/usr/lib:/usr/openwin/lib:/usr/dt/lib:/X11.6/lib:/X11.5/lib:/uva/lib:/gnu/lib:/opt/libgpuarray/lib:/usr/lib64"
Im assuming that horovod isnt able to run on CUDA this old is it?
You are already installing CUDA 10.0 via Conda so none of the above about changing environment variables is relevant. You will still need GPU drivers that are capable of supporting CUDA 10.0. What NVIDIA drivers are installed on the server?
Im not sure what kind are installed. How do I best check this? Conda info?
@illumidas-agn you should be able to run nvidia-smi
which will generate output including the driver versions and system CUDA version.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:04:00.0 Off | 0 | | N/A 43C P0 59W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:05:00.0 Off | 0 | | N/A 35C P0 73W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40c Off | 00000000:81:00.0 Off | 0 | | 23% 31C P0 64W / 235W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | 00000000:84:00.0 Off | 0 | | N/A 42C P0 69W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla K80 Off | 00000000:85:00.0 Off | 0 | | N/A 34C P0 83W / 149W | 0MiB / 11441MiB | 49% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
@illumidas-agn I don't see anything wrong with those drivers versions. Seems like the default CUDA installed on the server is 10.1. Can you run the command which conda
? This will give the path to the conda
binary and will probably indicate whether this was installed via Anaconda or Miniconda.
You need to update your Conda to the most recent version, likely by uninstalling Anaconda/Miniconda and then reinstalling most recent version. Once this is done, re-run the environment creation script as you were doing before and it should work.
Here is the output from which conda: /opt/anaconda3/bin/conda
Sadly the admin cant update conda as of right now, which really hinders this. I've been running the algorithm locally up until this point regardless
Looks like you have Anaconda3 distribution installed by the sysadmin. You can install Miniconda yourself in your user home using scripts I put together in this repo.
https://github.com/kaust-rccl/ibex-miniconda-install
After installation the output of which conda
should be something like ~/miniconda3/bin/conda
.
Ok that worked!
/home/cogs5/miniconda3/bin/conda
What would be the next step from here?
EDIT: Running the create-env-sh script right now and it worked! Although now I cant find the environment, I named it 'explore' and when listing the environments generated, I am unable to find it.
From the original create-env script I had to change conda to source in order to activate the environment, but the wrong name comes up
The environment creation script creates the environment in a subdirectory called env
of the "project directory" (i.e., the directory in which you ran the environment creation script). In the original script you had to change conda activate
to source activate
because you were using a very old version of Conda. Please make sure to use conda activate
from now on.
Sorry for the delay in my response. Im still getting this huge error when I'm running the script:
#
# To activate this environment, use
#
# $ conda activate /home/cogs5/ge/go-explore-master/env
#
# To deactivate an active environment, use
#
# $ conda deactivate
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1171, in _node_check
proc = Process(['node', 'node-version-check.js'], cwd=HERE, quiet=True)
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 73, in __init__
self.proc = self._create_process(cwd=cwd, env=env)
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 131, in _create_process
cmd[0] = which(cmd[0], kwargs.get('env'))
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/jlpmapp.py", line 59, in which
raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/anaconda3/bin/jupyter-lab", line 11, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 266, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/opt/anaconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 1571, in start
super(NotebookApp, self).start()
File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 255, in start
self.subapp.start()
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/labapp.py", line 64, in start
command=command, logger=self.log)
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 239, in build
_node_check()
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1175, in _node_check
raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
@illumidas-agn You need to configure your shell to use the conda activate
command. This should have been done for you when if you ran the installer script from the repo I linked above. Regardless, assuming you are using bash
you need to run the following commands
conda init bash
source ~/.bashrc # avoids having to restart terminal to load changes made my conda init
Once you have initialized conda
re-run the environment build script again I suspect the other errors are coming from the fact that the environment has not been activated properly before jupyter lab
command is run in the postBuild
script.
Keep going! I think we are almost there...
Still getting the same error,I dont think that worked
ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")
Attempting to roll back.
Rolling back transaction: done
FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with
$ echo ". /opt/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
or, for all users, enable conda with
$ sudo ln -s /opt/anaconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH. To do so, run
$ conda activate
in your terminal, or to put the base environment on PATH permanently, run
$ echo "conda activate" >> ~/.bashrc
Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file. You should manually remove the line that looks like
export PATH="/opt/anaconda3/bin:$PATH"
^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^
Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1171, in _node_check
proc = Process(['node', 'node-version-check.js'], cwd=HERE, quiet=True)
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 73, in __init__
self.proc = self._create_process(cwd=cwd, env=env)
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 131, in _create_process
cmd[0] = which(cmd[0], kwargs.get('env'))
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/jlpmapp.py", line 59, in which
raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/anaconda3/bin/jupyter-lab", line 11, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 266, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/opt/anaconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 1571, in start
super(NotebookApp, self).start()
File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 255, in start
self.subapp.start()
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/labapp.py", line 64, in start
command=command, logger=self.log)
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 239, in build
_node_check()
File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1175, in _node_check
raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
The issues you are experiencing are caused by the anaconda3
still being on your path somewhere. Can you please share the output of the following commands.
echo $SHELL
echo $PATH
cat ~/.bashrc
cogs5@sci-gpu:~/ge/go-explore-master$ echo $SHELL
/bin/sh
cogs5@sci-gpu:~/ge/go-explore-master$ echo $PATH
/opt/anaconda3/bin:/home/cogs5/miniconda3/condabin:/opt/anaconda3/bin:/media/data/cuda/cuda-8.0-cudnn5/bin:/opt/torch/install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin
cogs5@sci-gpu:~/ge/go-explore-master$ cat ~/.bashrc
# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
# don't put duplicate lines or lines starting with space in the history.
# See bash(1) for more options
HISTCONTROL=ignoreboth
# append to the history file, don't overwrite it
shopt -s histappend
# for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
HISTSIZE=1000
HISTFILESIZE=2000
# check the window size after each command and, if necessary,
# update the values of LINES and COLUMNS.
shopt -s checkwinsize
# If set, the pattern "**" used in a pathname expansion context will
# match all files and zero or more directories and subdirectories.
#shopt -s globstar
# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"
# set variable identifying the chroot you work in (used in the prompt below)
if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
debian_chroot=$(cat /etc/debian_chroot)
fi
# set a fancy prompt (non-color, unless we know we "want" color)
case "$TERM" in
xterm-color) color_prompt=yes;;
esac
# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
#force_color_prompt=yes
if [ -n "$force_color_prompt" ]; then
if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
# We have color support; assume it's compliant with Ecma-48
# (ISO/IEC-6429). (Lack of such support is extremely rare, and such
# a case would tend to support setf rather than setaf.)
color_prompt=yes
else
color_prompt=
fi
fi
if [ "$color_prompt" = yes ]; then
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt
# If this is an xterm set the title to user@host:dir
case "$TERM" in
xterm*|rxvt*)
PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
;;
*)
;;
esac
# enable color support of ls and also add handy aliases
if [ -x /usr/bin/dircolors ]; then
test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
alias ls='ls --color=auto'
#alias dir='dir --color=auto'
#alias vdir='vdir --color=auto'
alias grep='grep --color=auto'
alias fgrep='fgrep --color=auto'
alias egrep='egrep --color=auto'
fi
# some more ls aliases
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'
# Add an "alert" alias for long running commands. Use like so:
# sleep 10; alert
alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'
# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
if ! shopt -oq posix; then
if [ -f /usr/share/bash-completion/bash_completion ]; then
. /usr/share/bash-completion/bash_completion
elif [ -f /etc/bash_completion ]; then
. /etc/bash_completion
fi
fi
export PATH="/opt/anaconda3/bin:$PATH"
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/cogs5/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/cogs5/miniconda3/etc/profile.d/conda.sh" ]; then
. "/home/cogs5/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/home/cogs5/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
Environment:
Your question: Please ask your question here.
Looked through all the available open questions. Currently trying to run go-explore (https://github.com/uber-research/go-explore/tree/master/policy_based) and I have only managed to make horovod work once for whatever reason.
I need it built with tensorflow (aka horovod.tensorflow) and when I try to force the tensorflow flag during installation I get a 10 page log dump which is hard to discern what it actually needs.
How do I get horovod running?
Im not sure what im doing wrong, I've tried everything else