opensciencegrid / osgvo-tensorflow-gpu

OSGVO's TensorFlow image, GPU flavor
3 stars 9 forks source link

Change the start Docker since Tensroflow requires cuDNN v6.0 #1

Closed efajardo closed 6 years ago

efajardo commented 6 years ago

The tensorflow documentation requires cudNN v6.0:

Otherwise, you get this error. This was discussed in this issue

cuser1@cgpu-1:~$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory

@djw8605 can you please review and merge?

djw8605 commented 6 years ago

This is a very fragile singularity image. I'm scared to change much to it, it was difficult to get working in the first place.

I would feel much more comfortable if you were able to test this on an OSG gpu entry point.

efajardo commented 6 years ago

@djw8605 right now the image is busted and it works for no one. YOu can try it yourself:

So this PR won't leave you in any worse situation than what we are now:

singularity shell --bind /usr/lib64/nvidia:/host-libs /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:latest/
Singularity: Invoking an interactive shell within container...

cuser1@cgpu-1:~$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory

But I tried it directly on a GPU node at UCSD and worked, with some simple tensorflow matrix multiplication

 singularity exec docker://efajardo/osgvo-tensorflow-gpu python tf_matmul.py
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 21:07:04.461815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3 
2017-10-05 21:07:04.461838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y Y Y 
2017-10-05 21:07:04.461847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y Y Y 
2017-10-05 21:07:04.461855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2:   Y Y Y Y 
2017-10-05 21:07:04.461862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3:   Y Y Y Y 
2017-10-05 21:07:04.461878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0)
2017-10-05 21:07:04.461894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0)
2017-10-05 21:07:04.461903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:88:00.0)
2017-10-05 21:07:04.461911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:89:00.0)
result of matrix multiplication
===============================
[[  1.00000000e+00   0.00000000e+00]
 [ -4.76837158e-07   1.00000024e+00]]
===============================
efajardo commented 6 years ago

@djw8605 , I need some help. Something happens in the middle in the conversion of docker to singularity.

The docker container works fine

singularity exec docker://opensciencegrid/tensorflow-gpu python tf_matmul.py
2017-10-05 23:12:19.590915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:84:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:19.883190: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1b4e890 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-05 23:12:19.884666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:85:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:20.179617: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x2ef3840 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-05 23:12:20.180529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:88:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:20.481700: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x2ef78a0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-05 23:12:20.482656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 3 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:89:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-05 23:12:20.487009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3 
2017-10-05 23:12:20.487034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y Y Y 
2017-10-05 23:12:20.487042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y Y Y 
2017-10-05 23:12:20.487049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2:   Y Y Y Y 
2017-10-05 23:12:20.487057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3:   Y Y Y Y 
2017-10-05 23:12:20.487092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0)
2017-10-05 23:12:20.487105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0)
2017-10-05 23:12:20.487184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:88:00.0)
2017-10-05 23:12:20.487192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:89:00.0)
result of matrix multiplication
===============================
[[  1.00000000e+00   0.00000000e+00]
 [ -4.76837158e-07   1.00000024e+00]]
===============================

But the singularity image does not:

singularity exec /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu\:latest/ python tf_matmul.py
Traceback (most recent call last):
  File "tf_matmul.py", line 3, in <module>
    import tensorflow as tf
ImportError: No module named tensorflow

Although technically this should not be a prolem should it?

djw8605 commented 6 years ago

It looks like tensorflow is installed as a python3 package. So you have to replace your call to python to python3

Don't know why it changed. Possibly it changed upstream?

efajardo commented 6 years ago

Ahh I see. Thanks a lot. :satisfied: :clap: