horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
http://horovod.ai
Other
14.26k stars 2.24k forks source link

Trying to install Horovod from a fresh conda environment (with tensorflow) and nothing seems to work #2138

Closed illumidas-agn closed 4 years ago

illumidas-agn commented 4 years ago

Environment:

  1. Framework: (TensorFlow)
  2. Framework version:
  3. Horovod version: 0.19.5
  4. MPI version: -
  5. CUDA version: -
  6. NCCL version: -
  7. Python version: 3.6
  8. OS and version: Ubuntu
  9. GCC version: -

Your question: Please ask your question here.

Looked through all the available open questions. Currently trying to run go-explore (https://github.com/uber-research/go-explore/tree/master/policy_based) and I have only managed to make horovod work once for whatever reason.

I need it built with tensorflow (aka horovod.tensorflow) and when I try to force the tensorflow flag during installation I get a 10 page log dump which is hard to discern what it actually needs.

How do I get horovod running?

Im not sure what im doing wrong, I've tried everything else

tgaddair commented 4 years ago

Hey @illumidas-agn, have you taken a look at the Conda install guide here?

If you're still having issues after going through that, feel free to provide log output showing where things are breaking.

illumidas-agn commented 4 years ago

Hi @tgaddair, Im gonna go through that and let you know how it goes

illumidas-agn commented 4 years ago

Im trying to run this from a server where I am not the sudo user, is there any alternatives using pip or conda install?

tgaddair commented 4 years ago

Where are you running into permissions issues? @davidrpugh, do you have some thoughts on this?

If you don't need to use conda, you can also opt to install everything through pip in a virtual environment. But the important thing is you'll need to make sure the CUDA devtools are available when building in NCCL support. If this is an issue, you may want to see if you can run in a containerized environment.

illumidas-agn commented 4 years ago

I need to be able to run the CUDA Toolkit as sudo and sadly im not an admin, therefore I cant install it the conventional way. I was able to make it run on my local virtual environment and then when I transfered it into the server that's when I ran into issues. Tried the entire day to get horovod and cuda to run.

I can ask the admin of the server to install it if theres no quick fix for this

tgaddair commented 4 years ago

Either that or setting up Docker/Singularity would probably be the easiest way, yes. It's certainly possible to install locally (see: https://stackoverflow.com/questions/39379792/install-cuda-without-root), but managing the correct environment variables will likely be a challenge.

illumidas-agn commented 4 years ago

I see, in that case ill contact the server admin and ill get back to you if everything works. Thank you for your help

davidrpugh commented 4 years ago

@illumidas-agn you should be able to install Horovod using Conda without root privileges. You will need to use the cudatoolkit-dev=10.1 package from Conda Forge channel. The environment file below should work (you will still need to other files referenced in the Conda install guide).

name: null

channels:
  - pytorch
  - conda-forge
  - defaults

dependencies:
  - cmake=3.16
  - cudatoolkit-dev=10.1
  - cudnn=7.6
  - cupti=10.1
  - cxx-compiler=1.1
  - jupyterlab=2.1
  - matplotlib=3.2
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.5
  - nodejs=13
  - pip=20.1
  - pip:
    - mxnet-cu101mkl==1.6.* # makes sure installed prior to horovod
    - -r file:requirements.txt
  - python=3.7
  - pytorch=1.5
  - tensorboard=2.1
  - tensorflow-gpu=2.1
  - torchvision=0.6 

Note that I have bumped a lot of version numbers from what is in the current guide. @tgaddair I will test this and then update the install guide accordingly. Perhaps an explicit indication that you will need to use the cudatoolkit-dev approach if you don't have permissions to install CUDA toolkit as root.

illumidas-agn commented 4 years ago

The project im trying to run requires python 3.6, but I will see whether 3.7 works, will keep you updated

davidrpugh commented 4 years ago

@illumidas-agn Then just change the Python version to 3.6. Shouldn't impact the build.

illumidas-agn commented 4 years ago

I see, I just see it listed in the requirements that you have posted above

davidrpugh commented 4 years ago

@illumidas-agn I pin the version numbers in my environment file to the most recent versions of the various dependencies for which I am able to get a successful build. Other combinations of version numbers may also work just fine. Also note that if you only need TensorFlow then you can probably get by with the following environment file which should build more quickly.

name: null

channels:
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit-dev=10.1
  - cudnn=7.6
  - cupti=10.1
  - cxx-compiler=1.1
  - jupyterlab=2.1
  - matplotlib=3.2
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.5
  - nodejs=13
  - pip=20.1
  - pip:
    - -r file:requirements.txt
  - python=3.7 # python=3.6 should also work!
  - tensorboard=2.1
  - tensorflow-gpu=2.1

In the above environment file I have dropped the PyTorch and MXnet dependencies.

illumidas-agn commented 4 years ago

Perfect, currently installing all the packages as we speak

illumidas-agn commented 4 years ago

Installed all the packages, still getting errors when I use this command:

HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod[tensorflow]

Error at the end of log:

" Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

"

davidrpugh commented 4 years ago

Did you activate the Conda environment prior to running pip?

illumidas-agn commented 4 years ago

Yes, running it in the environment (Should be denoted by (envName) xxx@xxx)

davidrpugh commented 4 years ago

After activating the Conda environment run the command conda list and share the output.

illumidas-agn commented 4 years ago

_libgcc_mutex 0.1 main
_tflow_select 2.1.0 gpu
absl-py 0.9.0 py36_0
astor 0.8.0 py36_0
astor 0.8.1 astunparse 1.6.3 py_0
atari-py 0.2.6 attrs 19.3.0 py_0
backcall 0.2.0 py_0
binutils 2.33.1 h53a641e_8 conda-forge binutils_impl_linux-64 2.33.1 he1b5a44_7 conda-forge binutils_linux-64 2.33.1 h9595d00_17 conda-forge blas 1.0 mkl
bleach 3.1.5 py_0
bleach 1.5.0 blinker 1.4 py36_0
bokeh 2.1.1 brotlipy 0.7.0 py36h7b6447c_1000
c-ares 1.15.0 h7b6447c_1001
c-compiler 1.1.1 h516909a_0 conda-forge ca-certificates 2020.6.24 0
cachetools 4.1.0 py_1
certifi 2020.6.20 py36_0
cffi 1.14.0 py36he30daa8_1
chardet 3.0.4 py36_1003
click 7.1.2 py_0
cloog 0.18.0 0
cloudpickle 1.3.0 cmake 3.18.0 cryptography 2.9.2 py36h1ba5d50_0
cudatoolkit 10.0.130 0
cudatoolkit-dev 10.1.243 h516909a_3 conda-forge cudnn 7.6.5 cuda10.0_0
cupti 10.0.130 0
cxx-compiler 1.1.1 hc9558a2_0 conda-forge cycler 0.10.0 py36_0
dbus 1.13.16 hb2f20db_0
decorator 4.4.2 py_0
defusedxml 0.6.0 py_0
entrypoints 0.3 py36_0
expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
future 0.18.2 gast 0.2.2 py36_0
gast 0.2.2 gcc_impl_linux-64 7.3.0 hd420e75_5 conda-forge gcc_linux-64 7.3.0 h553295d_17 conda-forge glib 2.65.0 h3eb4bd4_0
gmp 6.1.2 h6c8ec71_1
google-auth 1.17.2 py_0
google-auth-oauthlib 0.4.1 py_2
google-pasta 0.2.0 py_0
grpcio 1.27.2 py36hf8bcb03_0
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
gxx_impl_linux-64 7.3.0 hdf63c60_5 conda-forge gxx_linux-64 7.3.0 h553295d_17 conda-forge gym 0.17.2 h5py 2.10.0 py36hd6299e0_1
hdf5 1.10.6 hb1b8bf9_0
html5lib 0.9999999 icu 58.2 he6710b0_3
idna 2.10 py_0
importlib-metadata 1.7.0 py36_0
importlib_metadata 1.7.0 0
intel-openmp 2020.1 217
ipykernel 5.3.3 py36h5ca1d4c_0
ipython 7.16.1 py36h5ca1d4c_0
ipython_genutils 0.2.0 py36_0
isl 0.12.2 0
jedi 0.17.1 py36_0
jinja2 2.11.2 py_0
jpeg 9b h024ee3a_2
json5 0.9.5 py_0
jsonschema 3.2.0 py36_0
jupyter_client 6.1.6 py_0
jupyter_core 4.6.3 py36_0
jupyterlab 2.1.5 py_0
jupyterlab_server 1.2.0 py_0
keras-applications 1.0.8 py_1
keras-preprocessing 1.1.0 py_1
kiwisolver 1.2.0 py36hfd86e86_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc 7.2.0 h69d50b8_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.12.3 hd408876_0
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 9.1.0 hdf63c60_0
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
markdown 3.1.1 py36_0
markupsafe 1.1.1 py36h7b6447c_0
matplotlib 3.2.2 0
matplotlib-base 3.2.2 py36hef1b27d_0
mistune 0.8.4 py36h7b6447c_0
mkl 2020.1 217
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.1.0 py36h23d657b_0
mkl_random 1.1.1 py36h0573a6f_0
mpc 1.0.3 hec55b23_5
mpfr 3.1.5 h11a74b3_2
mpi4py 3.0.3 nbconvert 5.6.1 py36_0
nbformat 5.0.7 py_0
nccl 1.3.5 cuda10.0_0
ncurses 6.2 he6710b0_1
nodejs 10.13.0 he6710b0_0
notebook 6.0.3 py36_0
numpy 1.18.5 py36ha1c710e_0
numpy-base 1.18.5 py36hde5b4d6_0
oauthlib 3.1.0 py_0
opencv-python 4.3.0.36 openssl 1.1.1g h7b6447c_0
opt_einsum 3.1.0 py_0
packaging 20.4 py_0
pandoc 2.10 0
pandocfilters 1.4.2 py36_1
parso 0.7.0 py_0
pcre 8.44 he6710b0_0
pexpect 4.8.0 py36_0
pickleshare 0.7.5 py36_0
Pillow 7.2.0 pip 20.1.1 py36_1
prometheus_client 0.8.0 py_0
prompt-toolkit 3.0.5 py_0
protobuf 3.12.3 py36he6710b0_0
psutil 5.7.2 ptyprocess 0.6.0 py36_0
pyasn1 0.4.8 py_0
pyasn1-modules 0.2.7 py_0
pycparser 2.20 py_2
pyglet 1.5.0 pygments 2.6.1 py_0
pyjwt 1.7.1 py36_0
pyopenssl 19.1.0 py_1
pyparsing 2.4.7 py_0
pyqt 5.9.2 py36h05f1152_2
pyrsistent 0.16.0 py36h7b6447c_0
pysocks 1.7.1 py36_0
python 3.6.10 h7579374_2
python-dateutil 2.8.1 py_0
python_abi 3.6 1_cp36m conda-forge PyYAML 5.3.1 pyzmq 19.0.1 py36he6710b0_1
qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
requests 2.24.0 py_0
requests-oauthlib 1.3.0 py_0
rsa 4.0 py_0
scipy 1.5.0 py36h0b6359f_0
send2trash 1.5.0 py36_0
setuptools 49.2.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
tensorboard 2.2.1 pyh532a8cf_0
tensorboard 1.15.0 tensorboard-plugin-wit 1.6.0 py_0
tensorflow 1.15.2 tensorflow 2.0.0 gpu_py36h6b29c10_0
tensorflow-base 2.0.0 gpu_py36h0ec5d1f_0
tensorflow-estimator 1.15.1 tensorflow-estimator 2.0.0 pyh2649769_0
tensorflow-gpu 2.0.0 h0d30ee6_0
tensorflow-tensorboard 1.5.1 termcolor 1.1.0 py36_1
terminado 0.8.3 py36_0
testpath 0.4.4 py_0
tk 8.6.10 hbc83047_0
tornado 6.0.4 py36h7b6447c_1
traitlets 4.3.3 py36_0
typing-extensions 3.7.4.2 urllib3 1.25.9 py_0
wcwidth 0.2.5 py_0
webencodings 0.5.1 py36_1
werkzeug 0.16.1 py_0
wheel 0.34.2 py36_0
wrapt 1.12.1 py36h7b6447c_1
xz 5.2.5 h7b6447c_0
zeromq 4.3.2 he6710b0_2
zipp 3.1.0 py_0
zlib 1.2.11 h7b6447c_3

davidrpugh commented 4 years ago

What version of TensorFlow are you trying to use? What changes did you make to the environment file I sketched above? You seem to have two different versions of TensorFlow installed 2.0 and 1.15; as well as different versions of the various CUDA toolkit libraries.

illumidas-agn commented 4 years ago
davidrpugh commented 4 years ago

OK. Well for sure to use TensorFlow 1.15 you will need to make changes to the environment file that I suggested above. For one, I think you will need an older version of cudatoolkit-dev and an older version of nccl. I think the following environment file should work.

name: null

channels:
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit-dev=10.0
  - cudnn=7.6
  - cupti=10.0
  - cxx-compiler=1.1
  - jupyterlab=2.1
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.4
  - nodejs=13
  - pip=20.1
  - python=3.6 
  - tensorboard=1.15
  - tensorflow-gpu=1.15

I forget whether you need to install Keras separately with TensorFlow 1.15 or not.

You will also need to set the following environment variables slightly differently then what is mentioned in the user guide given that you are using the cudatoolkit-dev approach.

$ export ENV_PREFIX=$PWD/env
$ export HOROVOD_CUDA_HOME=$ENV_PREFIX
$ export HOROVOD_NCCL_HOME=$ENV_PREFIX
$ export HOROVOD_GPU_OPERATIONS=NCCL

Next, create the Conda environment and try building Horovod using the following commands.

conda env create --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
pip install --no-cache-dir horovod==0.19.*

Don't bother explicitly setting HOROVOD_WITH_TENSORFLOW=1 and specifying horovod[tensorflow]. Just install Horovod and let Horovod determine which bindings to build (given only TF is installed it should only build TensorFlow).

illumidas-agn commented 4 years ago

It managed to install almost every package except the last one:

ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::cudatoolkit-dev-10.0-2'. LinkError: post-link script failed for package conda-forge::cudatoolkit-dev-10.0-2 running your command again with -v will provide additional information location of failed script: /home/cogs5/ge/go-explore-master/env/bin/.cudatoolkit-dev-post-link.sh

illumidas-agn commented 4 years ago

I used conda list on my local machine where go-explore works and it appears that the cuda tool kit isnt used? This is all very confusing....

_libgcc_mutex 0.1 main
absl-py 0.9.0 pypi_0 pypi appdirs 1.4.4 pypi_0 pypi astor 0.8.1 pypi_0 pypi atari-py 0.2.6 pypi_0 pypi baselines 0.1.6 dev_0 blas 1.0 mkl
ca-certificates 2020.6.24 0
certifi 2020.6.20 py36_0
cffi 1.14.0 pypi_0 pypi click 7.1.2 pypi_0 pypi cloudpickle 1.2.2 pypi_0 pypi cycler 0.10.0 py36_0
dataclasses 0.7 pypi_0 pypi dbus 1.13.16 hb2f20db_0
decorator 4.4.2 pypi_0 pypi expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
future 0.18.2 pypi_0 pypi gast 0.2.2 pypi_0 pypi glib 2.65.0 h3eb4bd4_0
google-pasta 0.2.0 pypi_0 pypi grpcio 1.30.0 pypi_0 pypi gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
gym 0.15.7 pypi_0 pypi h5py 2.10.0 pypi_0 pypi horovod 0.19.5 pypi_0 pypi icu 58.2 he6710b0_3
imageio 2.9.0 py_0
importlib-metadata 1.7.0 pypi_0 pypi intel-openmp 2020.1 217
joblib 0.16.0 pypi_0 pypi jpeg 9b h024ee3a_2
keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi kiwisolver 1.2.0 py36hfd86e86_0
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
loky 2.8.0 pypi_0 pypi lz4-c 1.9.2 he6710b0_0
mako 1.1.3 pypi_0 pypi markdown 3.2.2 pypi_0 pypi markupsafe 1.1.1 pypi_0 pypi matplotlib 3.2.2 0
matplotlib-base 3.2.2 py36hef1b27d_0
mkl 2020.1 217
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.1.0 py36h23d657b_0
mkl_random 1.1.1 py36h0573a6f_0
mpi 1.0 mpich
mpi4py 3.0.3 py36h028fd6f_0
mpich 3.3.2 hc856adb_0
ncurses 6.2 he6710b0_1
numpy 1.18.5 py36ha1c710e_0
numpy-base 1.18.5 py36hde5b4d6_0
nvcc_linux-64 11.0 h4962215_6 nvidia olefile 0.46 py36_0
opencv-python 4.3.0.36 pypi_0 pypi openssl 1.1.1g h7b6447c_0
opt-einsum 3.3.0 pypi_0 pypi pcre 8.44 he6710b0_0
pillow 7.2.0 py36hb39fc2d_0
pip 20.1.1 py36_1
protobuf 3.12.2 pypi_0 pypi psutil 5.7.2 pypi_0 pypi pycparser 2.20 pypi_0 pypi pyglet 1.5.0 pypi_0 pypi pyparsing 2.4.7 py_0
pyqt 5.9.2 py36h05f1152_2
python 3.6.10 h7579374_2
python-dateutil 2.8.1 py_0
pytools 2020.3.1 pypi_0 pypi pyyaml 5.3.1 pypi_0 pypi qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
scipy 1.5.1 pypi_0 pypi setuptools 49.2.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
tensorboard 1.15.0 pypi_0 pypi tensorflow 1.15.0 pypi_0 pypi tensorflow-estimator 1.15.1 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi tk 8.6.10 hbc83047_0
tornado 6.0.4 py36h7b6447c_1
tqdm 4.48.0 pypi_0 pypi werkzeug 1.0.1 pypi_0 pypi wheel 0.34.2 py36_0
wrapt 1.12.1 pypi_0 pypi xz 5.2.5 h7b6447c_0
zipp 3.1.0 pypi_0 pypi zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h0b5b093_0

davidrpugh commented 4 years ago

Looks like TensorFlow is still being installed via pip from pypi and not from Conda channels as expected. Are you installing TF by including tensorflow-gpu=1.15 as a dependency in your environment file? Or are you installing TF via pip?

Getting cudatoolkit-dev package to work properly is a bit tedious and error prone which is why the Conda guide recommends people install the full CUDA Toolkit themselves and then use nvcc_liunux-64 package to configure the environment, but I guess the standard install of CUDA Toolkit requires root.

There are several builds of cudatoolkit-dev available on conda-forge. From your error I can see that you got the second one, but you needed the third.

cudatoolkit-dev                 10.0               1  conda-forge         
cudatoolkit-dev                 10.0               2  conda-forge         
cudatoolkit-dev                 10.0          py36_0  conda-forge

Replace the cudatoolkit-dev=10.0 with cudatoolkit-dev=10.0=py36_0 to make sure that the specific build number is picked up during install. Try again and let me know if that helps.

davidrpugh commented 4 years ago

@illumidas-agn I have created (and tested) a Horovod build for TensorFlow 1.15 using cudatoolkit-dev=10.0=py36_0 with support for JupyterLab. All of the required config files can be found here. In particular see the bin/create-conda-env.sh script which automates the environment creation process. Follow the instructions carefully and let me know how you get on!

davidrpugh commented 4 years ago

@tgaddair When I built the environment for Horovod 19.5 setting HOROVOD_GPU_OPERATIONS=NCCL and then ran horovodrun --check-build it appeared that NCCL support was not built; when I built the environment using HOROVOD_GPU_ALLREDUCE=NCCL and HOROVOD_GPU_BROADCAST=NCCL and then ran horovodrun --check-build it appeared that NCCL support was built.

Hopefully NCCL support was actually built with HOROVOD_GPU_OPERATIONS=NCCL and this is just a bug in the horovodrun --check-build command.

tgaddair commented 4 years ago

Hey @davidrpugh, unfortunately HOROVOD_GPU_OPERATIONS was added recently to master and has not been released yet.

I recommend consulting the stable docs (for the latest release) when not building from source:

https://horovod.readthedocs.io/en/stable/summary_include.html#install

illumidas-agn commented 4 years ago

Looks like TensorFlow is still being installed via pip from pypi and not from Conda channels as expected. Are you installing TF by including tensorflow-gpu=1.15 as a dependency in your environment file? Or are you installing TF via pip?

Getting cudatoolkit-dev package to work properly is a bit tedious and error prone which is why the Conda guide recommends people install the full CUDA Toolkit themselves and then use nvcc_liunux-64 package to configure the environment, but I guess the standard install of CUDA Toolkit requires root.

There are several builds of cudatoolkit-dev available on conda-forge. From your error I can see that you got the second one, but you needed the third.

cudatoolkit-dev                 10.0               1  conda-forge         
cudatoolkit-dev                 10.0               2  conda-forge         
cudatoolkit-dev                 10.0          py36_0  conda-forge

Replace the cudatoolkit-dev=10.0 with cudatoolkit-dev=10.0=py36_0 to make sure that the specific build number is picked up during install. Try again and let me know if that helps.

I used Pip, yes, in order to get the specific version.

Im not sure how you mean to install the specific cuda version? Do I just use conda install cudatoolkit-dev=10.0=py36_0? Or do I need to edit the environment.yml file to include this when I build an environment?

EDIT: Got it working below

I apologize for the hassle, im brand new to anaconda and the ML environment.

Im currently following the sh script that you linked in order to try and make the environment, I'll let you know how that goes

illumidas-agn commented 4 years ago

Followed the scripts in the directory you showed me, overwrote the environment.yml file according to what was in that directory

Currently its unable to find python3.6 now:

"ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'. FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") "

davidrpugh commented 4 years ago

No worries @illumidas-agn! I am very experienced using Conda (+Pip) to build Python data science and machine learning environments and building the Horovod environment is probably the most difficult build that I have yet encountered. You are definitely jumping into the deep end!

There are three config files and you need ALL three for the environment build to work properly: environment.yml, requirements.txt, and postBuild. The bin/create-conda-env.sh script shows how these three files are connected. I would advise copying over all three config files and the environment build script into your project directory and then delete any existing environment and run the build script to re-create the environment from scratch. You should not be using conda install or pip install to install individual packages one at a time.

illumidas-agn commented 4 years ago

I've downloaded the files you've recommended above and put them onto the server. I've removed the environments (except one that I need for another project) and im still running into the same python error:

(base) cogs5@sci-gpu:~/ge/go-explore-master$ ./create-conda-env.sh Solving environment: done

Downloading and Extracting Packages libgfortran-ng 7.5.0: ####################################################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: failed ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'. FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'") Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")

davidrpugh commented 4 years ago

What version of conda are you using? If you don’t know then conda info will tell you.

Try running conda clean --all which will clean out cruft from all these failed builds and try running the build script again. There are a few more aggressive options that we can try if this doesn’t work.

https://docs.conda.io/projects/conda/en/latest/commands/clean.html

illumidas-agn commented 4 years ago

I ran a conda clean and cleared out the crufts

Its giving me the same astor error as previously posted

Conda info yields the following information:
conda version : 4.4.10 conda-build version : 3.4.1 python version : 3.6.4.final.0

davidrpugh commented 4 years ago

I should have asked this question at the very beginning. That conda version is at least two years old. You can try to update In place but I think the best approach would be to uninstall and reinstall.

Are you using the conda as part of the Anaconda Python distribution? Or Miniconda Python distribution?

illumidas-agn commented 4 years ago

Im not sure, this version was left on the server sometime ago. Ill contact the admin again and post here once its been updated. Ill retry all the steps and go from there.

illumidas-agn commented 4 years ago

I've received a response from the admin of the server stating that:

This machine is in desperate need of an upgrade, there isn't a version of CUDA 10 available for it. You might be better running it on your assigned machine instead as that is much more up to date in comparison.

There is a selection of cuda toolkit versions on sci-gpu, however these are old as no one has requested an additional version in some time.

ll /media/data/cuda

To select the different versions you need to change the following environment variables in your bashrc file. export CUDA_HOME="/media/data/cuda/cuda-8.0-cudnn5"

export PATH="/media/data/cuda/cuda-8.0-cudnn5/bin:$PATH"

export LD_LIBRARY_PATH="/media/data/cuda/cuda-8.0-cudnn5/lib64:/usr/lib:/usr/openwin/lib:/usr/dt/lib:/X11.6/lib:/X11.5/lib:/uva/lib:/gnu/lib:/opt/libgpuarray/lib:/usr/lib64"

Im assuming that horovod isnt able to run on CUDA this old is it?

davidrpugh commented 4 years ago

You are already installing CUDA 10.0 via Conda so none of the above about changing environment variables is relevant. You will still need GPU drivers that are capable of supporting CUDA 10.0. What NVIDIA drivers are installed on the server?

illumidas-agn commented 4 years ago

Im not sure what kind are installed. How do I best check this? Conda info?

davidrpugh commented 4 years ago

@illumidas-agn you should be able to run nvidia-smi which will generate output including the driver versions and system CUDA version.

illumidas-agn commented 4 years ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:04:00.0 Off | 0 | | N/A 43C P0 59W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:05:00.0 Off | 0 | | N/A 35C P0 73W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40c Off | 00000000:81:00.0 Off | 0 | | 23% 31C P0 64W / 235W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | 00000000:84:00.0 Off | 0 | | N/A 42C P0 69W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla K80 Off | 00000000:85:00.0 Off | 0 | | N/A 34C P0 83W / 149W | 0MiB / 11441MiB | 49% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

davidrpugh commented 4 years ago

@illumidas-agn I don't see anything wrong with those drivers versions. Seems like the default CUDA installed on the server is 10.1. Can you run the command which conda? This will give the path to the conda binary and will probably indicate whether this was installed via Anaconda or Miniconda.

You need to update your Conda to the most recent version, likely by uninstalling Anaconda/Miniconda and then reinstalling most recent version. Once this is done, re-run the environment creation script as you were doing before and it should work.

illumidas-agn commented 4 years ago

Here is the output from which conda: /opt/anaconda3/bin/conda

Sadly the admin cant update conda as of right now, which really hinders this. I've been running the algorithm locally up until this point regardless

davidrpugh commented 4 years ago

Looks like you have Anaconda3 distribution installed by the sysadmin. You can install Miniconda yourself in your user home using scripts I put together in this repo.

https://github.com/kaust-rccl/ibex-miniconda-install

After installation the output of which conda should be something like ~/miniconda3/bin/conda.

illumidas-agn commented 4 years ago

Ok that worked!

/home/cogs5/miniconda3/bin/conda

What would be the next step from here?

EDIT: Running the create-env-sh script right now and it worked! Although now I cant find the environment, I named it 'explore' and when listing the environments generated, I am unable to find it.

From the original create-env script I had to change conda to source in order to activate the environment, but the wrong name comes up

davidrpugh commented 4 years ago

The environment creation script creates the environment in a subdirectory called env of the "project directory" (i.e., the directory in which you ran the environment creation script). In the original script you had to change conda activate to source activate because you were using a very old version of Conda. Please make sure to use conda activate from now on.

illumidas-agn commented 4 years ago

Sorry for the delay in my response. Im still getting this huge error when I'm running the script:

#
# To activate this environment, use
#
#     $ conda activate /home/cogs5/ge/go-explore-master/env
#
# To deactivate an active environment, use
#
#     $ conda deactivate

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1171, in _node_check
    proc = Process(['node', 'node-version-check.js'], cwd=HERE, quiet=True)
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 73, in __init__
    self.proc = self._create_process(cwd=cwd, env=env)
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 131, in _create_process
    cmd[0] = which(cmd[0], kwargs.get('env'))
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/jlpmapp.py", line 59, in which
    raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/bin/jupyter-lab", line 11, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/opt/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/opt/anaconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 1571, in start
    super(NotebookApp, self).start()
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 255, in start
    self.subapp.start()
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/labapp.py", line 64, in start
    command=command, logger=self.log)
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 239, in build
    _node_check()
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1175, in _node_check
    raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
davidrpugh commented 4 years ago

@illumidas-agn You need to configure your shell to use the conda activate command. This should have been done for you when if you ran the installer script from the repo I linked above. Regardless, assuming you are using bash you need to run the following commands

conda init bash
source ~/.bashrc # avoids having to restart terminal to load changes made my conda init

Once you have initialized conda re-run the environment build script again I suspect the other errors are coming from the fact that the environment has not been activated properly before jupyter lab command is run in the postBuild script.

Keep going! I think we are almost there...

illumidas-agn commented 4 years ago

Still getting the same error,I dont think that worked

ERROR conda.core.link:_execute(481): An error occurred while installing package 'conda-forge::astor-0.8.1-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")
Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/home/cogs5/ge/go-explore-master/env/bin/python3.6'")

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with

    $ echo ". /opt/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc

or, for all users, enable conda with

    $ sudo ln -s /opt/anaconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH.  To do so, run

    $ conda activate

in your terminal, or to put the base environment on PATH permanently, run

    $ echo "conda activate" >> ~/.bashrc

Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file.  You should manually remove the line that looks like

    export PATH="/opt/anaconda3/bin:$PATH"

^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.

Errored, use --debug for full output:
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1171, in _node_check
    proc = Process(['node', 'node-version-check.js'], cwd=HERE, quiet=True)
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 73, in __init__
    self.proc = self._create_process(cwd=cwd, env=env)
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/process.py", line 131, in _create_process
    cmd[0] = which(cmd[0], kwargs.get('env'))
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/jlpmapp.py", line 59, in which
    raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing installation. nodejs may be installed using conda or directly from the nodejs website.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/bin/jupyter-lab", line 11, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/opt/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/opt/anaconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 1571, in start
    super(NotebookApp, self).start()
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 255, in start
    self.subapp.start()
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/labapp.py", line 64, in start
    command=command, logger=self.log)
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 239, in build
    _node_check()
  File "/opt/anaconda3/lib/python3.6/site-packages/jupyterlab/commands.py", line 1175, in _node_check
    raise ValueError(msg)
ValueError: Please install nodejs 5+ and npm before continuing. nodejs may be installed using conda or directly from the nodejs website.
davidrpugh commented 4 years ago

The issues you are experiencing are caused by the anaconda3 still being on your path somewhere. Can you please share the output of the following commands.

echo $SHELL
echo $PATH
cat ~/.bashrc
illumidas-agn commented 4 years ago
cogs5@sci-gpu:~/ge/go-explore-master$ echo $SHELL
/bin/sh

cogs5@sci-gpu:~/ge/go-explore-master$ echo $PATH
/opt/anaconda3/bin:/home/cogs5/miniconda3/condabin:/opt/anaconda3/bin:/media/data/cuda/cuda-8.0-cudnn5/bin:/opt/torch/install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin

cogs5@sci-gpu:~/ge/go-explore-master$ cat ~/.bashrc
# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

# don't put duplicate lines or lines starting with space in the history.
# See bash(1) for more options
HISTCONTROL=ignoreboth

# append to the history file, don't overwrite it
shopt -s histappend

# for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
HISTSIZE=1000
HISTFILESIZE=2000

# check the window size after each command and, if necessary,
# update the values of LINES and COLUMNS.
shopt -s checkwinsize

# If set, the pattern "**" used in a pathname expansion context will
# match all files and zero or more directories and subdirectories.
#shopt -s globstar

# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"

# set variable identifying the chroot you work in (used in the prompt below)
if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
    debian_chroot=$(cat /etc/debian_chroot)
fi

# set a fancy prompt (non-color, unless we know we "want" color)
case "$TERM" in
    xterm-color) color_prompt=yes;;
esac

# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
#force_color_prompt=yes

if [ -n "$force_color_prompt" ]; then
    if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
    # We have color support; assume it's compliant with Ecma-48
    # (ISO/IEC-6429). (Lack of such support is extremely rare, and such
    # a case would tend to support setf rather than setaf.)
    color_prompt=yes
    else
    color_prompt=
    fi
fi

if [ "$color_prompt" = yes ]; then
    PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
    PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt

# If this is an xterm set the title to user@host:dir
case "$TERM" in
xterm*|rxvt*)
    PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
    ;;
*)
    ;;
esac

# enable color support of ls and also add handy aliases
if [ -x /usr/bin/dircolors ]; then
    test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
    alias ls='ls --color=auto'
    #alias dir='dir --color=auto'
    #alias vdir='vdir --color=auto'

    alias grep='grep --color=auto'
    alias fgrep='fgrep --color=auto'
    alias egrep='egrep --color=auto'
fi

# some more ls aliases
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'

# Add an "alert" alias for long running commands.  Use like so:
#   sleep 10; alert
alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'

# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
if ! shopt -oq posix; then
  if [ -f /usr/share/bash-completion/bash_completion ]; then
    . /usr/share/bash-completion/bash_completion
  elif [ -f /etc/bash_completion ]; then
    . /etc/bash_completion
  fi
fi

export PATH="/opt/anaconda3/bin:$PATH"

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/cogs5/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/cogs5/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/home/cogs5/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/cogs5/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<