jade-hpc-gpu / jade-hpc-gpu.github.io

Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning (and this is the repo that powers its website)
http://www.jade.ac.uk/
Other
24 stars 8 forks source link

how to use virtual env with tensorflow-gpu for batch job #83

Open agniszczotka opened 5 years ago

agniszczotka commented 5 years ago

Software Request

How to configure my own virtual environment with tensorflow-gpu to run batch jobs on Jade?

I have created my conda environment and installed tensorflow-gpu in the environment. How can I ensure that submitted job runs with my virtual environment? How to configure paths for CUDA when using my own virtual environment? My project requirements are:

What is the best way to set up my working environment at JADE infrastructure?

twinkarma commented 5 years ago

Anaconda was working with previous TF versions but something seems to have gone wrong with v1.12

My TF installation (installed on the login node):

module load cuda/9.0
module load python3/anaconda

conda create -n mytensorflow python=3.6
source activate mytensorflow
pip install tensorflow-gpu

The sbatch script:

#!/bin/bash

# set the number of nodes
#SBATCH --nodes=1

# set number of GPUs
#SBATCH --gres=gpu:1

#Select a partition
#SBATCH --partition=devel

module load cuda/9.0
module load python3/anaconda

source activate mytensorflow
python testtf.py

The testtf.py, just a very basic tf test:

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b))

I'm getting this error:

    CUDA-9.0 loaded

    Python anaconda is now loaded in your environment.

Traceback (most recent call last):
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: /jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: symbol cublasSetMathMode, version libcublas.so.9.0 not defined in file libcublas.so.9.0 with link time reference

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hell.py", line 1, in <module>
    import tensorflow as tf
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: /jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: symbol cublasSetMathMode, version libcublas.so.9.0 not defined in file libcublas.so.9.0 with link time reference

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.
twinkarma commented 5 years ago

It turns out that when not using module load cuda/9.0, the code example works. Can we look in to why this is? @LiamATOS

agniszczotka commented 5 years ago

cuda 9.2 requires new nvidia drivers (version 390): https://www.pugetsystems.com/labs/hpc/Install-TensorFlow-with-GPU-Support-the-Easy-Way-on-Ubuntu-18-04-without-installing-CUDA-1170/ https://www.pugetsystems.com/labs/hpc/How-to-install-CUDA-9-2-on-Ubuntu-18-04-1184/

agniszczotka commented 5 years ago

you do not load module load cuda/9.0 with anaconda. it causes an issue because anaconda has Cuda image internally.

agniszczotka commented 5 years ago

can you add module libs/cudnn/7.3.1.20/binary-cuda-9.0.176

It turns out that when not using module load cuda/9.0, the code example works. Can we look in to why this is? @LiamATOS

it does not work when you use convolution which runs cudnn >7.1

f90 commented 5 years ago

I also noticed in my application when I want to use my own pip virtual env together with the cuda/9.0 module that when importing tensorflow I then get

ImportError: /jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: symbol cublasSetMathMode, version libcublas.so.9.0 not defined in file libcublas.so.9.0 with link time reference

I noticed that on my own computer I use a slightly different CUDA 9.0 version, not 9.0.69 but rather 9.0.176, maybe thats why it breaks together with Tensorflow 1.8 as well as 1.12?

llyhec commented 5 years ago

I am having the same problem. Everything was working a few days ago, however now I am unable to run my jobs using the same code.

Loading my usual environment and modules:

module load python3/anaconda
source activate testcon1
module load keras/2.1.4

gives me the following error:

WARNING: python3/3.6.3 cannot be loaded due to a conflict.
HINT: Might try "module unload python3" first.
        GCC 5.5.0 environment now loaded
        CUDA-8.0 loaded

So, I unloaded python3, and loaded Keras which loads CUDA and tensorflow:

module unload python3
module load keras/2.1.4

   Utility programs for GCC loaded
        readline, ncurses, mercurial, Tcl-Tk, Xvfb, X11 libs, etc.

        Python 3.6.3 is now loaded in your environment.

        GCC 5.5.0 environment now loaded

        CUDA-8.0 loaded

        Keras-2.1.4, Tensorflow-1.4.1 with Python3 and CUDA loaded.
        Check your $HOME/.tensorflowrc file is OK.

and tried running my code and the above test code. I get the following error:

 File "test.py", line 1, in <module>
    import tensorflow as tf
  File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 72, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/jmain01/apps/python3/3.6.3/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/jmain01/apps/python3/3.6.3/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

I tried various things including creating a new environment and reinstalled tensorflow-gpu, but I get the same problem.

Any ideas how I could solve this please?

twinkarma commented 5 years ago

If you're choosing to use anaconda, I'd recommend you create a virtual environment and install your own version of tensorflow and keras through pip install or conda install. Don't load keras or tensorflow module if you're planning to do this.

The the keras module will load its own python environment that's different and conflicts with anaconda.

JP-MRPhys commented 4 years ago

Hi all, I am also facing the same error, that I can't link cuda/9.0 module and therefore unable to use tensorflow version 1.12.0 . I can confirm that this works for tensorflow version 2.1.0 using cuda/10.1. I would appreciate if there are any pointers

File "tftest.py", line 1, in import tensorflow as tf File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/init.py", line 24, in from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/init.py", line 49, in from tensorflow.python import pywrap_tensorflow File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in raise ImportError(msg) ImportError: Traceback (most recent call last): File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in from tensorflow.python.pywrap_tensorflow_internal import * File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in _pywrap_tensorflow_internal = swig_import_helper() File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description) File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/imp.py", line 243, in load_module return load_dynamic(name, filename, file) File "/jmain01/home/JAD029/txl04/jxp48-txl04/.conda/envs/tensorflow-gpu/lib/python3.6/imp.py", line 343, in load_dynamic return _load(spec) ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

twinkarma commented 4 years ago

Hi @JP-MRPhys Would you be able to share your bash script?