Cannot run TensorFlow on GPU - RHEL 6

Ettrai commented 7 years ago

Hello, I have been trying for days to take advantage of the in one of the machines I have access to. Given that I have no root access I had to compile everything from source. I tried both the last stable release and the current master branch but I had no luck at running TensorFlow on the GPU.

My setup is the following : Red Hat EL 6.8 (no root access) Python 2.7.8 virtualenv 13.1.0 devtoolset-4 (GCC 5.3.1) Bazel 0.4.3 (built from source) GeForce GTX680 (compute capability 3.0) Cuda Toolkit 8.0 cuDNN 5.1

I had to modify few configuration files such that the configure script could complete successfully :

diff --git a/configure b/configure
index a8e7bb773..002094aba 100755
--- a/configure
+++ b/configure
@@ -39,7 +39,7 @@ function bazel_clean_and_fetch() {
   # bazel clean --expunge currently doesn't work on Windows
   # TODO(pcloudy): Re-enable it after bazel clean --expunge is fixed.
   if ! is_windows; then
-    bazel clean --expunge
+    bazel clean --expunge_async
   fi
   bazel fetch "//tensorflow/... -//tensorflow/examples/android/..."
 }

diff --git a/tensorflow/core/platform/default/build_config.bzl b/tensorflow/core/platform/default/build_config.bzl
index ebf835d11..824471640 100644
--- a/tensorflow/core/platform/default/build_config.bzl
+++ b/tensorflow/core/platform/default/build_config.bzl
@@ -8,7 +8,7 @@ load("//tensorflow:tensorflow.bzl", "if_not_mobile")
 WITH_GCP_SUPPORT = False
 WITH_HDFS_SUPPORT = False
 WITH_XLA_SUPPORT = False
-WITH_JEMALLOC = True
+WITH_JEMALLOC = False

 # Appends a suffix to a list of deps.
 def tf_deps(deps, suffix):

diff --git a/tensorflow/tensorflow.bzl b/tensorflow/tensorflow.bzl
index 7fa7e4a91..ef41f5cd9 100644
--- a/tensorflow/tensorflow.bzl
+++ b/tensorflow/tensorflow.bzl
@@ -714,7 +714,8 @@ def tf_custom_op_library(name, srcs=[], gpu_srcs=[], deps=[]):
   )

 def tf_extension_linkopts():
-  return []  # No extension link opts
+  #return []  # No extension link opts
+  return ["-lrt"] 

 def tf_extension_copts():
   return []  # No extension c opts

diff --git a/third_party/gpus/crosstool/CROSSTOOL.tpl b/third_party/gpus/crosstool/CROSSTOOL.tpl
index b77a45c32..e1fb068a2 100644
--- a/third_party/gpus/crosstool/CROSSTOOL.tpl
+++ b/third_party/gpus/crosstool/CROSSTOOL.tpl
@@ -56,6 +56,8 @@ toolchain {
   cxx_flag: "-std=c++11"
   linker_flag: "-Wl,-no-as-needed"
   linker_flag: "-lstdc++"
+  linker_flag: "-lm"
+  linker_flag: "-lrt"
   linker_flag: "-B/usr/bin/"

At the point in which I have to ask bazel to build TensorFlow I face a weird problem. If I use --config=cuda8.0 the building process completes but the gpu is never used nor detected.

If I use --config=cuda the building process fails with the following error

ERROR: /home/emt1627/.cache/bazel/_bazel_emt1627/aeec3eab67314b40e280b02ed0028dfc/external/nasm/BUILD:8:1: undeclared inclusion(s) in rule '@nasm//:nasm':
this rule is missing dependency declarations for the following files included by 'external/nasm/regvals.c':
  '/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/stddef.h'
  '/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/stdarg.h'
  '/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/stdint.h'.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.

If I run it again I get a similar error

ERROR: /home/emt1627/.cache/bazel/_bazel_emt1627/aeec3eab67314b40e280b02ed0028dfc/external/nasm/BUILD:8:1: undeclared inclusion(s) in rule '@nasm//:nasm':
this rule is missing dependency declarations for the following files included by 'external/nasm/iflag.c':
  '/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/stdint.h'
  '/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/stddef.h'
  '/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/stdarg.h'.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.

I have also tried different configurations of Cuda toolkit and cuDNN library, but those all led nowhere near the solution.

gunan commented 7 years ago

We do not have official support for RHEL. (However, I think I have a solution below) Bazel is complaining that "nasm" package tensorflow depends on is looking for explicit build dependencies for the listed headers. The problem looks unrelated to CUDA or TensorFlow. maybe it can be related to this build file, but for core system libraries and headers, bazel should not be asking for explicit dependencies.

A quick search shows me that this error looks very similar to this one: https://github.com/tensorflow/tensorflow/issues/3431#issuecomment-234131699

So in your case, let's try this. Right after this line: https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/crosstool/CROSSTOOL.tpl#L124 could you try adding: `cxx_builtin_include_directory: "/opt/rh/devtoolset-4/root/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include"

Then run configure and build again. Please let me know if it works or not.

Ettrai commented 7 years ago

@gunan thank you for your reply! I did as you suggested and that error has not shown up anymore.

I have now a new error I have just started to troubleshoot :

ERROR: /home/emt1627/.cache/bazel/_bazel_emt1627/aeec3eab67314b40e280b02ed0028dfc/external/nasm/BUILD:8:1: Linking of rule '@nasm//:nasm' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command
  (cd /home/emt1627/.cache/bazel/_bazel_emt1627/aeec3eab67314b40e280b02ed0028dfc/execroot/tensorflow-nightly && \
  exec env - \
    LD_LIBRARY_PATH=/home/emt1627/opt/cudnn-8.0-linux-x64-v5.1:/usr/lib64/nvidia:/home/emt1627/opt/cuda-8.0/lib64:/opt/rh/devtoolset-4/root/usr/lib64:/opt/rh/devtoolset-4/root/usr/lib:/opt/rh/python27/root/usr/lib64 \
    PATH=/home/emt1627/virtualenv/tensorflow-nightly-GPU/bin:/home/emt1627/opt/cuda-8.0/bin:/home/emt1627/opt/git-2.11/bin:/home/emt1627/opt/htop-2.0.2/bin:/home/emt1627/opt/jdk1.8.0_112/bin:/home/emt1627/opt/bazel-0.4.3-dist/output:/sbin:/usr/sbin:/usr/local/sbin:/opt/rh/devtoolset-4/root/usr/bin:/opt/rh/rh-java-common/root/usr/bin:/opt/rh/python27/root/usr/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/host/bin/external/nasm/nasm -Wl,-no-as-needed -B/usr/bin/ -pie -Wl,-z,relro,-z,now -no-canonical-prefixes -pass-exit-codes '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -Wl,-S -Wl,--gc-sections -Wl,@bazel-out/host/bin/external/nasm/nasm-2.params): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
/usr/bin/ld: unrecognized option '-plugin'
/usr/bin/ld: use the --help option for usage information
collect2: error: ld returned 1 exit status

gunan commented 7 years ago

Not sure what is going on there. Maybe this helps? http://stackoverflow.com/questions/24890865/usr-bin-ld-unrecognized-option-plugin-error

Ettrai commented 7 years ago

I ended up using this solution : https://github.com/bazelbuild/bazel/issues/361

I managed to compile the nightly version of tensorflow, now I am rebuilding r0.12.1 to try some of the bundled examples (e.g. models/images/mnist/convolutional.py) .

I tried to run those with the nightly version but I ended up experiencing this issue : https://github.com/tensorflow/models/issues/857

As soon as I manage to run that example properly I will post my diff.

Ettrai commented 7 years ago

I managed to build TensorFlow 0.12.1 with GPU support on the following configuration :

Red Hat EL 6.8 (no root access) Python 2.7.8 virtualenv 13.1.0 devtoolset-4 (GCC 5.3.1) Bazel 0.4.3 (built from source) GeForce GTX680 (compute capability 3.0) Cuda Toolkit 8.0 cuDNN 5.1

This is my final diff :

diff --git a/configure b/configure
index 3fc0b5909..33e73b8d0 100755
--- a/configure
+++ b/configure
@@ -22,7 +22,7 @@ function bazel_clean_and_fetch() {
   # bazel clean --expunge currently doesn't work on Windows
   # TODO(pcloudy): Re-enable it after bazel clean --expunge is fixed.
   if ! is_windows; then
-    bazel clean --expunge
+    bazel clean --expunge_async
   fi
   bazel fetch //tensorflow/...
 }

diff --git a/tensorflow/tensorflow.bzl b/tensorflow/tensorflow.bzl
index d78cb7b57..42bf7c8b6 100644
--- a/tensorflow/tensorflow.bzl
+++ b/tensorflow/tensorflow.bzl
@@ -792,7 +792,7 @@ def tf_custom_op_library(name, srcs=[], gpu_srcs=[], deps=[]):
   )

 def tf_extension_linkopts():
-  return []  # No extension link opts
+  return ["-lrt"]

 def tf_extension_copts():
   return []  # No extension c opts

diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 06e16cdb0..d1ac0544e 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -228,7 +228,7 @@ def tf_workspace(path_prefix = "", tf_repo_name = ""):

   native.new_http_archive(
     name = "zlib_archive",
-    url = "http://zlib.net/zlib-1.2.8.tar.gz",
+    url = "http://zlib.net/fossils/zlib-1.2.8.tar.gz",
     sha256 = "36658cb768a54c1d4dec43c3116c27ed893e88b02ecfcb44f2166f9c0b7f2a0d",
     strip_prefix = "zlib-1.2.8",
     build_file = str(Label("//:zlib.BUILD")),

diff --git a/third_party/gpus/crosstool/CROSSTOOL.tpl b/third_party/gpus/crosstool/CROSSTOOL.tpl
index 3ce6b74a5..06e572691 100644
--- a/third_party/gpus/crosstool/CROSSTOOL.tpl
+++ b/third_party/gpus/crosstool/CROSSTOOL.tpl
@@ -55,7 +55,9 @@ toolchain {
   # and the device compiler to use "-std=c++11".
   cxx_flag: "-std=c++11"
   linker_flag: "-lstdc++"
-  linker_flag: "-B/usr/bin/"
+  linker_flag: "-lm"
+  linker_flag: "-lrt"
+  linker_flag: "-B/opt/rh/devtoolset-4/root/usr/bin"

 %{gcc_host_compiler_includes}
   tool_path { name: "gcov" path: "/usr/bin/gcov" }
@@ -121,6 +123,8 @@ toolchain {

   # Include directory for cuda headers.
   cxx_builtin_include_directory: "%{cuda_include_path}"
+  cxx_builtin_include_directory: "/opt/rh/devtoolset-4/root/usr/lib"
+  cxx_builtin_include_directory: "/opt/rh/devtoolset-4/root/usr/include"

   compilation_mode_flags {
     mode: DBG

I hope this helps and thank you @gunan !

gunan commented 7 years ago

I will try to see if I can add this to either an FAQ in our docs, or incorporate the modifications through our bazel switches. I will keep the issue open until then.

Sinan81 commented 7 years ago

Thanks for reporting this bug!

Ettrai commented 7 years ago

@Sinan81, I hope this saved you some time!

gunan commented 7 years ago

Looks like this problem is resolved. I suspect some of the problems we ran into here have been due to you not having full access to the system. I was able to test build on a centos docker container without any modifications. But I will try to see if I can reproduce your problems.

Thanks for patiently working through all the issues and documenting your steps here!

tensorflow / tensorflow

Cannot run TensorFlow on GPU - RHEL 6 #7118