Open agniszczotka opened 5 years ago
Anaconda was working with previous TF versions but something seems to have gone wrong with v1.12
My TF installation (installed on the login node):
module load cuda/9.0
module load python3/anaconda
conda create -n mytensorflow python=3.6
source activate mytensorflow
pip install tensorflow-gpu
The sbatch script:
#!/bin/bash
# set the number of nodes
#SBATCH --nodes=1
# set number of GPUs
#SBATCH --gres=gpu:1
#Select a partition
#SBATCH --partition=devel
module load cuda/9.0
module load python3/anaconda
source activate mytensorflow
python testtf.py
The testtf.py
, just a very basic tf test:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b))
I'm getting this error:
CUDA-9.0 loaded
Python anaconda is now loaded in your environment.
Traceback (most recent call last):
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: /jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: symbol cublasSetMathMode, version libcublas.so.9.0 not defined in file libcublas.so.9.0 with link time reference
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "hell.py", line 1, in <module>
import tensorflow as tf
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: /jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: symbol cublasSetMathMode, version libcublas.so.9.0 not defined in file libcublas.so.9.0 with link time reference
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
It turns out that when not using module load cuda/9.0
, the code example works. Can we look in to why this is? @LiamATOS
you do not load module load cuda/9.0 with anaconda. it causes an issue because anaconda has Cuda image internally.
can you add module libs/cudnn/7.3.1.20/binary-cuda-9.0.176
It turns out that when not using
module load cuda/9.0
, the code example works. Can we look in to why this is? @LiamATOS
it does not work when you use convolution which runs cudnn >7.1
I also noticed in my application when I want to use my own pip virtual env together with the cuda/9.0 module that when importing tensorflow I then get
ImportError: /jmain01/home/JAD009/txk06/txk31-txk06/.conda/envs/testten/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: symbol cublasSetMathMode, version libcublas.so.9.0 not defined in file libcublas.so.9.0 with link time reference
I noticed that on my own computer I use a slightly different CUDA 9.0 version, not 9.0.69 but rather 9.0.176, maybe thats why it breaks together with Tensorflow 1.8 as well as 1.12?
I am having the same problem. Everything was working a few days ago, however now I am unable to run my jobs using the same code.
Loading my usual environment and modules:
module load python3/anaconda
source activate testcon1
module load keras/2.1.4
gives me the following error:
WARNING: python3/3.6.3 cannot be loaded due to a conflict.
HINT: Might try "module unload python3" first.
GCC 5.5.0 environment now loaded
CUDA-8.0 loaded
So, I unloaded python3, and loaded Keras which loads CUDA and tensorflow:
module unload python3
module load keras/2.1.4
Utility programs for GCC loaded
readline, ncurses, mercurial, Tcl-Tk, Xvfb, X11 libs, etc.
Python 3.6.3 is now loaded in your environment.
GCC 5.5.0 environment now loaded
CUDA-8.0 loaded
Keras-2.1.4, Tensorflow-1.4.1 with Python3 and CUDA loaded.
Check your $HOME/.tensorflowrc file is OK.
and tried running my code and the above test code. I get the following error:
File "test.py", line 1, in <module>
import tensorflow as tf
File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 72, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/jmain01/apps/python3/tensorflow/1.4.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/jmain01/apps/python3/3.6.3/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/jmain01/apps/python3/3.6.3/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/install_sources#common_installation_problems
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
I tried various things including creating a new environment and reinstalled tensorflow-gpu, but I get the same problem.
Any ideas how I could solve this please?
If you're choosing to use anaconda, I'd recommend you create a virtual environment and install your own version of tensorflow and keras through pip install or conda install. Don't load keras or tensorflow module if you're planning to do this.
The the keras module will load its own python environment that's different and conflicts with anaconda.
Hi all, I am also facing the same error, that I can't link cuda/9.0 module and therefore unable to use tensorflow version 1.12.0 . I can confirm that this works for tensorflow version 2.1.0 using cuda/10.1. I would appreciate if there are any pointers
File "tftest.py", line 1, in
Hi @JP-MRPhys Would you be able to share your bash script?
Software Request
How to configure my own virtual environment with tensorflow-gpu to run batch jobs on Jade?
I have created my conda environment and installed tensorflow-gpu in the environment. How can I ensure that submitted job runs with my virtual environment? How to configure paths for CUDA when using my own virtual environment? My project requirements are:
What is the best way to set up my working environment at JADE infrastructure?