Closed efajardo closed 6 years ago
This is a very fragile singularity image. I'm scared to change much to it, it was difficult to get working in the first place.
I would feel much more comfortable if you were able to test this on an OSG gpu entry point.
@djw8605 right now the image is busted and it works for no one. YOu can try it yourself:
So this PR won't leave you in any worse situation than what we are now:
singularity shell --bind /usr/lib64/nvidia:/host-libs /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:latest/
Singularity: Invoking an interactive shell within container...
cuser1@cgpu-1:~$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory
But I tried it directly on a GPU node at UCSD and worked, with some simple tensorflow matrix multiplication
singularity exec docker://efajardo/osgvo-tensorflow-gpu python tf_matmul.py
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 21:07:04.461815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3
2017-10-05 21:07:04.461838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y Y Y
2017-10-05 21:07:04.461847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y Y Y
2017-10-05 21:07:04.461855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2: Y Y Y Y
2017-10-05 21:07:04.461862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3: Y Y Y Y
2017-10-05 21:07:04.461878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0)
2017-10-05 21:07:04.461894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0)
2017-10-05 21:07:04.461903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:88:00.0)
2017-10-05 21:07:04.461911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:89:00.0)
result of matrix multiplication
===============================
[[ 1.00000000e+00 0.00000000e+00]
[ -4.76837158e-07 1.00000024e+00]]
===============================
@djw8605 , I need some help. Something happens in the middle in the conversion of docker to singularity.
The docker container works fine
singularity exec docker://opensciencegrid/tensorflow-gpu python tf_matmul.py
2017-10-05 23:12:19.590915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:84:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:19.883190: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1b4e890 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-05 23:12:19.884666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:85:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:20.179617: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x2ef3840 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-05 23:12:20.180529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:88:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:20.481700: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x2ef78a0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-05 23:12:20.482656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 3 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:89:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:20.487009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3
2017-10-05 23:12:20.487034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y Y Y
2017-10-05 23:12:20.487042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y Y Y
2017-10-05 23:12:20.487049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2: Y Y Y Y
2017-10-05 23:12:20.487057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3: Y Y Y Y
2017-10-05 23:12:20.487092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0)
2017-10-05 23:12:20.487105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0)
2017-10-05 23:12:20.487184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:88:00.0)
2017-10-05 23:12:20.487192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:89:00.0)
result of matrix multiplication
===============================
[[ 1.00000000e+00 0.00000000e+00]
[ -4.76837158e-07 1.00000024e+00]]
===============================
But the singularity image does not:
singularity exec /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu\:latest/ python tf_matmul.py
Traceback (most recent call last):
File "tf_matmul.py", line 3, in <module>
import tensorflow as tf
ImportError: No module named tensorflow
Although technically this should not be a prolem should it?
It looks like tensorflow is installed as a python3 package. So you have to replace your call to python
to python3
Don't know why it changed. Possibly it changed upstream?
Ahh I see. Thanks a lot. :satisfied: :clap:
The tensorflow documentation requires cudNN v6.0:
Otherwise, you get this error. This was discussed in this issue
@djw8605 can you please review and merge?