tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.54k stars 74.18k forks source link

Support for Redhat, Centos and many superclusters #110

Closed trungnt13 closed 8 years ago

trungnt13 commented 8 years ago

Many clusters system using module with Redhat or Centos < 7 which is glibc 2.12

Since, bazel requires glibc 2.14 and the prebuilt version for linux requires glibc 2.17. It is hopeless to make tensorflow run on clusters.

Referred to this issue reported on bazel: https://github.com/bazelbuild/bazel/issues/583

vrv commented 8 years ago

Since we depend on bazel, this sounds like a bazel issue.

Feel free to re-open if bazel ends up supporting 2.12 or lower, and we can see what we can do.

alantus commented 8 years ago

Am I right that you depend on bazel only at build-time? If this is true then it can be viewed as something you could do something about too... You could also release static-linked packages that would be very useful to people stuck on clusters with old libraries...

urimerhav commented 8 years ago

So did anyone find some way past this problem? I'm using redhat 6.4, as is my entire corporation. We're stuck on redhat 6.4. I'm not sure how to end up running tensorflow on such a machine...

ttrouill commented 8 years ago

I managed to have it running on a CentOS 6.7 : http://stackoverflow.com/a/34897674/1990516 :) Tell me if it works for you.

Edit: I proposed an alternative solution also: http://stackoverflow.com/a/34900471/1990516

urimerhav commented 8 years ago

Thanks man! I'll look into it as soon as I can.

Sent from my IPhone

On Jan 20, 2016, at 2:41 AM, Théo Trouillon notifications@github.com wrote:

I managed to have it running on a CentOS 6.7 : http://stackoverflow.com/a/34897674/1990516 :) Tell me if it works for you

— Reply to this email directly or view it on GitHub.

altaetran commented 8 years ago

Could you let me know if this worked? I can't seem to get any of these other solutions working.

urimerhav commented 8 years ago

Since @ttrouill only says he got it working on 6.7 so I didn't check whether this works on 6.4 actually...

rdipietro commented 8 years ago

Both solutions seem to work, but they're not optimal. TensorFlow and Python seem to run okay, but if I try and run IPython, then with the first solution I get an Invalid ELF error, and with the second solution there is a memory leak and IPython continues to absorb all memory with time. I believe that this can also happen with other Python imports that rely on libraries that were compiled using the older libc.

I'd love to see a straightforward how-to-compile-bazel-with-old-glibc guide, but I haven't come across one yet.

rdipietro commented 8 years ago

Also https://github.com/bazelbuild/bazel/issues/760 is relevant, but it's far from straightforward and my attempt to build bazel using this guide failed. Hopefully within the next few weeks I can give it some more time and continue that thread with the errors I end up getting.

rdipietro commented 8 years ago

Compiling on CentOS still isn't all that straightforward, but I figured I'd give an overview here for now. This works for me with CentOS 6.7 and gcc 4.8.2, with GPU support (Cuda 7.0, cuDNN 4.0.7). A bazel modification for building with a custom gcc is in the works (https://github.com/bazelbuild/bazel/issues/760) and should help streamline this later on.

The instructions here are specific to my base gcc path of /cm/shared/apps/gcc/4.8.2, but it should work for other configurations just by modifying the base path.

Paths for reference: gcc path: /cm/shared/apps/gcc/4.8.2/bin/gcc cpp path: /cm/shared/apps/gcc/4.8.2/bin/cpp lib64 path: /cm/shared/apps/gcc/4.8.2/lib64 include1 dir: /cm/shared/apps/gcc/4.8.2/lib/gcc/x86_64-unknown-linux-gnu/4.8.2/include include2 dir: /cm/shared/apps/gcc/4.8.2/lib/gcc/x86_64-unknown-linux-gnu/4.8.2/include-fixed include3 dir: /cm/shared/apps/gcc/4.8.2/include/c++/4.8.2

Bazel

  1. git clone https://github.com/bazelbuild/bazel.git && cd bazel
  2. Edit tools/cpp/CROSSTOOL
    • Replace all occurrences of /usr/bin/gcc with gcc path
    • Replace all occurrences of /usr/bin/cpp with cpp path
    • After the toolpath containing gcc path, add the lines
      • linker_flag: "-Wl,-Rlib64 path"
      • cxx_builtin_include_directory: "include1 dir"
      • cxx_builtin_include_directory: "include2 dir"
      • cxx_builtin_include_directory: "include3 dir"
  3. Edit scripts/bootstrap/buildenv.sh
    • Comment out atexit "rm -fr ${DIR}"
  4. export EXTRA_BAZEL_ARGS='-s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 8'
  5. ./compile.sh

TensorFlow

  1. git clone --recurse-submodules https://github.com/tensorflow/tensorflow && cd tensorflow
  2. Edit third_party/gpus/crosstool/CROSSTOOL, making the same changes we made for Bazel. (/usr/bin/gcc etc. likely won't need to be replaced, though.)
  3. Edit third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc
    • Replace all /usr/bin/gcc with gcc path.
    • Undo the temporary "fix" to find as by commenting out the line cmd = 'PATH=' + PREFIX_DIR + ' ' + cmd. (For me, this is necessary to find as.)
  4. ./configure
  5. export EXTRA_BAZEL_ARGS='-s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 8'
  6. bazel build -c opt --config=cuda --linkopt '-lrt' --copt="-DGPR_BACKWARDS_COMPATIBILITY_MODE" --conlyopt="-std=c99" //tensorflow/tools/pip_package:build_pip_package
    • Why the strange flags? Because otherwise, after building with the older libc, we'll get an error about secure_getenv.
  7. bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
  8. pip install ~/tensorflow_pkg/*
rdipietro commented 8 years ago

Update: Previous process was for a commit after release 7.

Here are necessary changes for commit 1d4fd06, which is after release 8:

  1. You need Bazel 0.2.x. As of this writing, with appropriate environment variables, Bazel at HEAD compiles simply with ./compile.sh. Thank you @damienmg !
  2. You still need to make the above changes to the TensorFlow files, including the changes to CROSSTOOL etc. (For some reason the bazel auto config doesn't work here.)
  3. Edit third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc and replace #!/usr/bin/env python2.7 with #!/usr/bin/env /full/path/to/python2.7. This is a hack to avoid bazel's confined environment from failing to pick up our custom Python location.
  4. Edit bazel-out/host/bin/tensorflow/swig and add export LD_LIBRARY_PATH=custom:paths:$LD_LIBRARY_PATH before swig is run. Otherwise swig won't find libraries that exist in our LD_LIBRARY_PATH. This is another hack to get around the confined environment.
  5. Use the same bazel build command from above: bazel build -c opt --config=cuda --linkopt '-lrt' --copt="-DGPR_BACKWARDS_COMPATIBILITY_MODE" --conlyopt="-std=c99" //tensorflow/tools/pip_package:build_pip_package
  6. cd bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles and cp -r __main__/* .. This is a hack associated with https://github.com/tensorflow/tensorflow/issues/2040.
  7. Finally we can bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg, and
  8. pip install ~/tensorflow_pkg/*
trungnt13 commented 8 years ago

Our administrator managed to run pip installed tensorflow package on RHEL 6.7 server (without building bazel and tensorflow source), the core idea is get separated newer version of GLIBC version:

Fast test:

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b))

Note: this approach is only for running python scripts, remember that, every time you add $libcroot to your path all the shell commands are corrupted (i.e you cannot use ls, cd ...). You might use bash -l, or screen, or byobu before you try this so you don't mess up your own session.

rdipietro commented 8 years ago

Yeah that was described here a while back, but as you mention, it's not ideal. For example if you run Jupyter it'll lead to a memory leak / crash (at least on the system I tried it with).

kskp commented 8 years ago

@rdipietro

Edit tools/cpp/CROSSTOOL After the toolpath containing gcc path, add the lines linker_flag: "-Wl,-Rlib64 path" cxx_builtin_include_directory: "include1 dir" cxx_builtin_include_directory: "include2 dir" cxx_builtin_include_directory: "include3 dir"

Should these lines be added after every occurence of the toolpath containing gcc path- i.e. twice wherever i changed the usr/bin/gcc ?

rdipietro commented 8 years ago

I don't know what you mean by twice. I'm pretty sure I only inserted those lines once, although if you were to insert them in multiple places it probably wouldn't do any harm.

damienmg commented 8 years ago

@kskp @rdipietro : is that still needed with latest version of Bazel? If yes then we have an issue in the C++ detection code.

rdipietro commented 8 years ago

Bazel compiles out of the box as long as I set CC correctly. I haven't tried with TensorFlow 0.9, but as of 0.8, I still had to make manual changes on CentOS.

damienmg commented 8 years ago

You mean change to the cuda crosstool file?

On Fri, Jun 24, 2016 at 2:30 PM Robert DiPietro notifications@github.com wrote:

Bazel compiles out of the box as long as I set CC correctly. I haven't tried with TensorFlow 0.9, but as of 0.8, I still had to make manual changes on CentOS.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/110#issuecomment-228333271, or mute the thread https://github.com/notifications/unsubscribe/ADjHf_Ij539IWtrDlTebMajjTTI87GSBks5qO83SgaJpZM4Gf6Qp .

rdipietro commented 8 years ago

Yes. My May 17 comment above includes everything I needed to do. Specifically, needed to edit CROSSTOOL and needed to introduce two hacks to get bazel to find things outside of its isolated environment.

kskp commented 8 years ago

@rdipietro Thanks for your reply. Sorry for my ignorance, but could you please tell me what toolpath is? I am assuming it is the block of code where the gcc path had to be changed. I did that twice in the entire file (Since it said to replace all occurences of /usr/bin/gcc). So do I have to add those lines after the block of code where I changed the /usr/bin/gcc path??

kskp commented 8 years ago

@rdipietro @damienmg I am not using the latest version of Bazel. I need the 0.2.2b version. I ultimately have to run Syntaxnet on Cent OS 6.7.

damienmg commented 8 years ago

0.2.2b should work too.

On Fri, Jun 24, 2016 at 2:55 PM kskp notifications@github.com wrote:

@rdipietro https://github.com/rdipietro @damienmg https://github.com/damienmg I am not using the latest version of Bazel. I need the 0.2.2b version. I ultimately have to run Syntaxnet on Cent OS 6.7.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/110#issuecomment-228337683, or mute the thread https://github.com/notifications/unsubscribe/ADjHf4sjm971bfucsyRzcsZk_rgAUo8qks5qO9ObgaJpZM4Gf6Qp .

kskp commented 8 years ago

Oh, I tried a couple of weeks ago but it did not work. Will do it again today. Thanks for your reply.

damienmg commented 8 years ago

note that you still have to do the CUDA CROSSTOOL modification for doing it with --config cuda

kskp commented 8 years ago

Oops, I am not configuring it with CUDA support. Is it a must?

damienmg commented 8 years ago

You need to update tensorflow's CROSSTOOL for CUDA support. @davidzchen is making the change to TF to have the same support but it has not yet landed.

On Fri, Jun 24, 2016 at 3:12 PM kskp notifications@github.com wrote:

Oops, I am not configuring it with CUDA support. Is it a must?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/110#issuecomment-228341016, or mute the thread https://github.com/notifications/unsubscribe/ADjHf4akIOCd-PCi8YNs-P7aoopVOUV2ks5qO9ejgaJpZM4Gf6Qp .

davidzchen commented 8 years ago

FYI Here is the tracking bug for CUDA autoconfiguration: #2873.

It is partially working, but I still need to fix the remaining path issues, such as getting the Python SWIG wrapper to find the tensorflow library correctly.

kskp commented 8 years ago

@damienmg @rdipietro Bazel still does not compile.

Just for your information, my system info:

[sree@ds1 bazel]$ gcc -v gcc version 4.8.2 20140120 (Red Hat 4.8.2-15) (GCC)

[sree@ds1 bazel]$ ldd --version ldd (GNU libc) 2.12

[sree@ds1 bazel]$ which gcc /usr/bin/gcc

[sree@ds1 bazel]$ g++ -v gcc version 4.8.2 20140120 (Red Hat 4.8.2-15) (GCC)

[sree@ds1 bazel]$ which g++ /usr/bin/g++

To build bazel, I do the following:

  1. git clone https://github.com/bazelbuild/bazel.git
  2. cd bazel
  3. git rag -l
  4. git checkout tags/0.2.2b
  5. ./compile.sh

./compile.sh gives; [sree@ds1 bazel]$ ./compile.sh INFO: You can skip this first step by providing a path to the bazel binary as second argument: INFO: ./compile.sh compile /path/to/bazel 🍃 Building Bazel from scratch...... 🍃 Building Bazel with Bazel. INFO: Found 1 target... ERROR: /home/sree/bazel/src/main/cpp/util/BUILD:24:1: C++ compilation of rule '//src/main/cpp/util:md5' failed: gcc failed: error executing command (cd /tmp/bazel.NO5ObMNe/out/bazel && \ exec env - \ PATH=/home/sree/anaconda2/bin:/home/sree/bazel:/opt/jdk1.8.0_91/bin:/opt/jdk1.8.0_91/jre/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/sree/bin \ /usr/bin/gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -Wl,-z,-relro,-z,now -B/usr/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++0x' -iquote . -iquote bazel-out/local-fastbuild/genfiles -iquote external/bazel_tools -iquote bazel-out/local-fastbuild/genfiles/external/bazel_tools -isystem external/bazel_tools/tools/cpp/gcc3 -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-DDATE="redacted"' '-DTIMESTAMP="redacted"' '-DTIME="redacted"' '-frandom-seed=bazel-out/local-fastbuild/bin/src/main/cpp/util/_objs/md5/src/main/cpp/util/md5.pic.o' -MD -MF bazel-out/local-fastbuild/bin/src/main/cpp/util/_objs/md5/src/main/cpp/util/md5.pic.d -fPIC -c src/main/cpp/util/md5.cc -o bazel-out/local-fastbuild/bin/src/main/cpp/util/_objs/md5/src/main/cpp/util/md5.pic.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1. gcc: error trying to exec 'cc1plus': execvp: No such file or directory Target //src:bazel failed to build INFO: Elapsed time: 3.147s, Critical Path: 0.07s

Building output/bazel

Am I even doing it right? I did not make any changes to tools/cpp/CROSSTOOL file.

damienmg commented 8 years ago

What does echo | gcc -E -xc++ - -v returns?

kskp commented 8 years ago

@damienmg

Using built-in specs. COLLECT_GCC=gcc Target: x86_64-redhat-linux Configured with: ../configure --prefix=/opt/rh/devtoolset-2/root/usr --mandir=/opt/rh/devtoolset-2/root/usr/share/man --infodir=/opt/rh/devtoolset-2/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --enable-languages=c,c++,fortran,lto --enable-plugin --with-linker-hash-style=gnu --enable-initfini-array --disable-libgcj --with-isl=/dev/shm/home/centos/rpm/BUILD/gcc-4.8.2-20140120/obj-x86_64-redhat-linux/isl-install --with-cloog=/dev/shm/home/centos/rpm/BUILD/gcc-4.8.2-20140120/obj-x86_64-redhat-linux/cloog-install --with-mpc=/dev/shm/home/centos/rpm/BUILD/gcc-4.8.2-20140120/obj-x86_64-redhat-linux/mpc-install --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.2 20140120 (Red Hat 4.8.2-15) (GCC) COLLECT_GCC_OPTIONS='-E' '-v' '-mtune=generic' '-march=x86-64' cc1plus -E -quiet -v -iprefix /usr/bin/../lib/gcc/x86_64-redhat-linux/4.8.2/ -D_GNU_SOURCE - -mtune=generic -march=x86-64 gcc: error trying to exec 'cc1plus': execvp: No such file or directory

kskp commented 8 years ago

Also, I installed gcc 4.8.2 using the instructions given at: http://superuser.com/questions/381160/how-to-install-gcc-4-7-x-4-8-x-on-centos. And since nothing happened, I did the following:

sudo mv /usr/bin/gcc /usr/bin/gcc.bak sudo cp /opt/rh/devtoolset-2/root/usr/bin/gcc /usr/bin/gcc sudo mv /usr/bin/g++ /usr/bin/g++.bak sudo cp /opt/rh/devtoolset-2/root/usr/bin/g++ /usr/bin/g++

damienmg commented 8 years ago
export CC=/opt/rh/devtoolset-2/root/usr/bin/gcc
./compile.sh

should work (at least it works in our integration test).

I believe the cp made gcc a bit confused.

kskp commented 8 years ago

Thanks, Now I have different errors:

[sree@ds1 bazel]$ ./compile.sh INFO: You can skip this first step by providing a path to the bazel binary as second argument: INFO: ./compile.sh compile /path/to/bazel 🍃 Building Bazel from scratch...... 🍃 Building Bazel with Bazel. INFO: Found 1 target... ERROR: /home/sree/bazel/src/main/tools/BUILD:3:1: C++ compilation of rule '//src/main/tools:network-tools' failed: gcc failed: error executing command (cd /tmp/bazel.7v8MzbLT/out/bazel && \ exec env - \ PATH=/home/sree/anaconda2/bin:/home/sree/bazel:/opt/jdk1.8.0_91/bin:/opt/jdk1.8.0_91/jre/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/sree/bin \ /opt/rh/devtoolset-2/root/usr/bin/gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -Wl,-z,-relro,-z,now -B/opt/rh/devtoolset-2/root/usr/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -iquote . -iquote bazel-out/local-fastbuild/genfiles -iquote external/bazel_tools -iquote bazel-out/local-fastbuild/genfiles/external/bazel_tools -isystem external/bazel_tools/tools/cpp/gcc3 '-std=c99' -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-DDATE="redacted"' '-DTIMESTAMP="redacted"' '-DTIME="redacted"' '-frandom-seed=bazel-out/local-fastbuild/bin/src/main/tools/_objs/network-tools/src/main/tools/network-tools.pic.o' -MD -MF bazel-out/local-fastbuild/bin/src/main/tools/_objs/network-tools/src/main/tools/network-tools.pic.d -fPIC -c src/main/tools/network-tools.c -o bazel-out/local-fastbuild/bin/src/main/tools/_objs/network-tools/src/main/tools/network-tools.pic.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1. cc1: error: unrecognized command line option '-quiet' cc1: error: bazel-out/local-fastbuild/bin/src/main/tools/_objs/network-tools/src/main/tools/network-tools.pic.d: No such file or directory cc1: error: unrecognized command line option '-quiet' cc1: error: unrecognized command line option '-auxbase-strip bazel-out/local-fastbuild/bin/src/main/tools/_objs/network-tools/src/main/tools/network-tools.pic.o' Target //src:bazel failed to build INFO: Elapsed time: 3.917s, Critical Path: 0.31s

damienmg commented 8 years ago

What does echo | /opt/rh/devtoolset-2/root/usr/bin/gcc -E -xc++ - -v says?

It seems like your compiler doesn't like your own installation. Can you try to restore /usr/bin/gcc and /usr/bin/g++ to the default value?

kskp commented 8 years ago

[sree@ds1 ~]$ echo | /opt/rh/devtoolset-2/root/usr/bin/gcc -E -xc++ - -v Using built-in specs. COLLECT_GCC=/opt/rh/devtoolset-2/root/usr/bin/gcc Target: x86_64-redhat-linux Configured with: ../configure --prefix=/opt/rh/devtoolset-2/root/usr --mandir=/opt/rh/devtoolset-2/root/usr/share/man --infodir=/opt/rh/devtoolset-2/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --enable-languages=c,c++,fortran,lto --enable-plugin --with-linker-hash-style=gnu --enable-initfini-array --disable-libgcj --with-isl=/dev/shm/home/centos/rpm/BUILD/gcc-4.8.2-20140120/obj-x86_64-redhat-linux/isl-install --with-cloog=/dev/shm/home/centos/rpm/BUILD/gcc-4.8.2-20140120/obj-x86_64-redhat-linux/cloog-install --with-mpc=/dev/shm/home/centos/rpm/BUILD/gcc-4.8.2-20140120/obj-x86_64-redhat-linux/mpc-install --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.2 20140120 (Red Hat 4.8.2-15) (GCC) COLLECT_GCC_OPTIONS='-E' '-v' '-mtune=generic' '-march=x86-64' /opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus -E -quiet -v -D_GNU_SOURCE - -mtune=generic -march=x86-64 ignoring nonexistent directory "/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/include-fixed" ignoring nonexistent directory "/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/../../../../x86_64-redhat-linux/include"

include "..." search starts here:

include <...> search starts here:

/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/../../../../include/c++/4.8.2 /opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/../../../../include/c++/4.8.2/x86_64-redhat-linux /opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/../../../../include/c++/4.8.2/backward /opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/include /usr/local/include /opt/rh/devtoolset-2/root/usr/include /usr/include End of search list.

1 ""

1 ""

1 ""

COMPILER_PATH=/opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/:/opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/:/opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/:/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/:/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/ LIBRARY_PATH=/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/:/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/opt/rh/devtoolset-2/root/usr/lib/gcc/x86_64-redhat-linux/4.8.2/../../../:/lib/:/usr/lib/ COLLECT_GCC_OPTIONS='-E' '-v' '-mtune=generic' '-march=x86-64'

Seems what you said is right. I will restrore both the files to default values.

kskp commented 8 years ago

My which gcc says: /usr/bin/gcc But echo $CC says: /opt/rh/devtoolset-2/root/usr/bin/gcc

And hence even after restoring older gcc, I still get gcc version as 4.8.2.

Did I ruin everything? I was super nervous that I might break the core by making changes to gcc on centos 6.

Is there a way I can rollback all the changes or can you point me to where I can get a good gcc latest version?

damienmg commented 8 years ago

gcc -v still says 4.8.2?

What does ./compile.sh result in now?

kskp commented 8 years ago

gcc -v is still 4.8.2

./compile.sh still results in an error:

[sree@ds1 bazel]$ ./compile.sh INFO: You can skip this first step by providing a path to the bazel binary as second argument: INFO: ./compile.sh compile /path/to/bazel 🍃 Building Bazel from scratch...... 🍃 Building Bazel with Bazel. INFO: Found 1 target... ERROR: /home/sree/bazel/src/main/cpp/BUILD:53:1: C++ compilation of rule '//src/main/cpp:blaze_abrupt_exit' failed: gcc failed: error executing command (cd /tmp/bazel.HegZ1Mxo/out/bazel && \ exec env - \ PATH=/home/sree/anaconda2/bin:/home/sree/bazel:/opt/jdk1.8.0_91/bin:/opt/jdk1.8.0_91/jre/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/sree/bin \ /usr/bin/gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -Wl,-z,-relro,-z,now -B/usr/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++0x' -iquote . -iquote bazel-out/local-fastbuild/genfiles -iquote external/bazel_tools -iquote bazel-out/local-fastbuild/genfiles/external/bazel_tools -isystem external/bazel_tools/tools/cpp/gcc3 -no-canonical-prefixes -fno-canonical-system-headers -Wno-builtin-macro-redefined '-DDATE="redacted"' '-DTIMESTAMP="redacted"' '-DTIME="redacted"' '-frandom-seed=bazel-out/local-fastbuild/bin/src/main/cpp/_objs/blaze_abrupt_exit/src/main/cpp/blaze_abrupt_exit.pic.o' -MD -MF bazel-out/local-fastbuild/bin/src/main/cpp/_objs/blaze_abrupt_exit/src/main/cpp/blaze_abrupt_exit.pic.d -fPIC -c src/main/cpp/blaze_abrupt_exit.cc -o bazel-out/local-fastbuild/bin/src/main/cpp/_objs/blaze_abrupt_exit/src/main/cpp/blaze_abrupt_exit.pic.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1. gcc: error trying to exec 'cc1plus': execvp: No such file or directory Target //src:bazel failed to build INFO: Elapsed time: 3.592s, Critical Path: 0.12s

Building output/bazel

trungnt13 commented 8 years ago

Tensorflow is built successfully on CPU, however, it is failed on GPU.

I keep getting this error, even though I modified all path in CROSSTOOL and crosstool_wrapper... from /usr/bin to my gcc path

ERROR: /homeappl/home/trungnt/.cache/bazel/_bazel_trungnt/07601e513c2336fd42387644d3f95e2b/external/protobuf/BUILD:331:1: Linking of rule '@protobuf//:protoc' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command 
  (cd /homeappl/home/trungnt/.cache/bazel/_bazel_trungnt/07601e513c2336fd42387644d3f95e2b/execroot/tensorflow && \
  exec env - \
  third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/host/bin/external/protobuf/protoc bazel-out/host/bin/external/protobuf/_objs/protoc/external/protobuf/src/google/protobuf/compiler/main.o bazel-out/host/bin/external/protobuf/libprotoc_lib.a bazel-out/host/bin/external/protobuf/libprotobuf.a bazel-out/host/bin/external/protobuf/libprotobuf_lite.a -lpthread -lstdc++ -B/appl/opt/gcc/4.9.1/bin/ -pie -Wl,-z,relro,-z,now -no-canonical-prefixes -pass-exit-codes '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -Wl,-S -Wl,--gc-sections): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
collect2: fatal error: cannot find 'ld'
compilation terminated.
Target //tensorflow/cc:tutorials_example_trainer failed to build
INFO: Elapsed time: 71.231s, Critical Path: 56.80s
mukul1992 commented 8 years ago

Hello, @rdipietro : I am trying to install tensorflow/0.9.0 on a cluster running CentOS 6.7. I have bazel installed already. Here is the error I am getting. ERROR: /gpfs_home/mdave/.cache/bazel/_bazel_mdave/541ff47a1a214f62e91d090e1e816e43/external/highwayhash/BUILD:17:1: C++ compilation of rule '@highwayhash//:sip_hash' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 36 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 127. /gpfs/runtime/opt/python/2.7.3/bin/python2.7: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory Target //tensorflow/tools/pip_package:build_pip_package failed to build

I suppose the fix for this, as mentioned by you in the step-wise directions is:

  1. Edit bazel-out/host/bin/tensorflow/swig and add export LD_LIBRARY_PATH=custom:paths:$LD_LIBRARY_PATH before swig is run. Otherwise swigwon't find libraries that exist in our LD_LIBRARY_PATH. This is another hack to get around the confined environment.

This should add the python library path while setting up the build but I do not seem to find a file such as bazel-out/host/bin/tensorflow/swig in the source tree, while the bazel-out/host/bin/tensorflow directory does exist. If I create a file named swig myself and add the command to export the paths, it still does not work. Any ideas? I have followed all other steps as mentioned.

Thank you for the help. Your responses here have already been very helpful. :)

rdipietro commented 8 years ago

Hi @mukul1992

Sorry, I'm still working with 0.8, so haven't battled with the 0.9 changes yet.

Here is a suggestion:

Use --verbose_failures with bazel, so that error messages aren't truncated. Then sift through the failure to find out which script ends up causing the issue. Then try putting export LD_LIBRARY_PATH=your:custom:paths:$LD_LIBRARY_PATH at the top of that file.

Hopefully that might help. I don't think I'll have the time to get around to compiling 0.9 for a while. If that doesn't work, I suggest shooting back to 0.8 for now (assuming you don't need something that's cutting edge?).

mukul1992 commented 8 years ago

Hi @rdipietro , thanks for replying.

So, I switched back to 0.8. I am now using Bazel 0.3.0 (any previous version which would work better?). Here is the output. I am just including the ERROR part which is in Bold. Again, I did complete other steps. I cannot figure out where to add the LD_LIBRARY_PATH thing so that it picks up the libpython library.

Output:

[mdave@login001 tensorflow]$ bazel build -c opt --config=cuda --linkopt '-lrt' --copt="-DGPR_BACKWARDS_COMPATIBILITY_MODE" --conlyopt="-std=c99" //tensorflow/tools/pip_package:build_pip_package -s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone Warning: ignoring LD_PRELOAD in environment.

INFO: Found 1 target...

@re2//:re2 [action 'Compiling external/re2/re2/compile.cc [for host]']

.(cd /gpfs_home/mdave/.cache/bazel/_bazel_mdave/c9818020e0087a4155dff2f5c73aa150/execroot/tensorflow && \ exec env - \ PATH=/gpfs/runtime/opt/git/2.2.1/bin:/gpfs/runtime/opt/gcc/4.9.2/bin:/gpfs/runtime/opt/java/8u66/bin:/gpfs/runtime/opt/bazel/0.3.0/bin:/gpfs/runtime/opt/matlab/R2014a/bin:/gpfs/runtime/opt/perl/5.18.2/bin:/gpfs/runtime/opt/python/2.7.3/bin:/gpfs/runtime/opt/intel/2013.1.106/bin:/gpfs/runtime/opt/centos-updates/6.3/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/ibutils/bin:/gpfs/runtime/bin:/users/mdave/bin \ third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -g0 '-std=c++11' '-frandom-seed=bazel-out/host/bin/external/re2/_objs/re2/external/re2/re2/compile.o' -iquote external/re2 -iquote bazel-out/host/genfiles/external/re2 -iquote external/bazel_tools -iquote bazel-out/host/genfiles/external/bazel_tools -isystem external/re2 -isystem bazel-out/host/genfiles/external/re2 -isystem external/bazel_tools/tools/cpp/gcc3 -no-canonical-prefixes -Wno-builtin-macro-redefined '-DDATE="redacted"' '-DTIMESTAMP="redacted"' '-DTIME="redacted"' -fno-canonical-system-headers -MD -MF bazel-out/host/bin/external/re2/_objs/re2/external/re2/re2/compile.d -c external/re2/re2/compile.cc -o bazel-out/host/bin/external/re2/_objs/re2/external/re2/re2/compile.o)

ERROR: /gpfs_home/mdave/.cache/bazel/_bazel_mdave/c9818020e0087a4155dff2f5c73aa150/external/re2/BUILD:9:1: C++ compilation of rule '@re2//:re2' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command (cd /gpfs_home/mdave/.cache/bazel/_bazel_mdave/c9818020e0087a4155dff2f5c73aa150/execroot/tensorflow && \ exec env - \ PATH=/gpfs/runtime/opt/git/2.2.1/bin:/gpfs/runtime/opt/gcc/4.9.2/bin:/gpfs/runtime/opt/java/8u66/bin:/gpfs/runtime/opt/bazel/0.3.0/bin:/gpfs/runtime/opt/matlab/R2014a/bin:/gpfs/runtime/opt/perl/5.18.2/bin:/gpfs/runtime/opt/python/2.7.3/bin:/gpfs/runtime/opt/intel/2013.1.106/bin:/gpfs/runtime/opt/centos-updates/6.3/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/ibutils/bin:/gpfs/runtime/bin:/users/mdave/bin \ third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -g0 '-std=c++11' '-frandom-seed=bazel-out/host/bin/external/re2/_objs/re2/external/re2/re2/compile.o' -iquote external/re2 -iquote bazel-out/host/genfiles/external/re2 -iquote external/bazel_tools -iquote bazel-out/host/genfiles/external/bazel_tools -isystem external/re2 -isystem bazel-out/host/genfiles/external/re2 -isystem external/bazel_tools/tools/cpp/gcc3 -no-canonical-prefixes -Wno-builtin-macro-redefined '-DDATE="redacted"' '-DTIMESTAMP="redacted"' '-DTIME="redacted"' -fno-canonical-system-headers -MD -MF bazel-out/host/bin/external/re2/_objs/re2/external/re2/re2/compile.d -c external/re2/re2/compile.cc -o bazel-out/host/bin/external/re2/_objs/re2/external/re2/re2/compile.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 127. /gpfs/runtime/opt/python/2.7.3/bin/python2.7: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory

Target //tensorflow/tools/pip_package:build_pip_package failed to build INFO: Elapsed time: 13.883s, Critical Path: 5.22s

kskp commented 8 years ago

@rdipietro Hi, I have tried everything you gave here- changed the CROSSTOOL files and everything but it does not work. I started fresh again and believe I have bazel working. Can you please look at my description here and suggest something. Thanks a lot!

rdipietro commented 8 years ago

I really don't know what to suggest. Other than perhaps asking TensorFlow to build binaries for CentOS 6.7. I think this would save a lot of people a lot of trouble and would repeatedly save all this trouble each new release, but I don't know if they're willing to do it.

On Thu, Jul 21, 2016 at 11:21 AM, kskp notifications@github.com wrote:

@rdipietro https://github.com/rdipietro Hi, I have tried everything you gave here- changed the CROSSTOOL files and everything but it does not work. I started fresh again and believe I have bazel working. Can you please look at my description here https://github.com/tensorflow/models/issues/276 and suggest something. Thanks a lot!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/110#issuecomment-234287228, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6XX5jGX-7ZS0arN1p7eyvJNSGB4QLjks5qX46DgaJpZM4Gf6Qp .

kskp commented 8 years ago

@rdipietro Sorry, but didn't you mention you had tensorflow running on centos 6.7 and gcc 4.8.2? Were you able to run Syntaxnet also? I am stuck with the Centos 6.6 cluster and need to get Syntaxnet running on this. It works fine on Centos 7. :(

cirocavani commented 8 years ago

@kskp I created a Dockerfile that compiles TensorFlow 0.9 CPU for CentOS 6, I tested in CentOS 6 and RedHat EL 6.5. You can use a standalone machine to generate the TensorFlow Package and test in you site. (your standalone machine will need to have Docker, I tested in linux and macOS with Docker for Mac installed)

https://github.com/cirocavani/tensorflow-poc/tree/master/tensorflow_centos6

(main.sh is the procedure script)

I did also an installer for TensorFlow with miniconda2 to run in Red Hat 6.5 without any pre-requirement software.

https://github.com/cirocavani/tensorflow-poc/tree/master/tensorflow_installer

(main.sh is the procedure script)

This procedure creates the installer file tensorflow.sh with Miniconda2, TensorFlow 0.9, deps and python program (executing this files will install Miniconda, install TensorFlow and run the training script).

My main case is to run TensorFlow in Hadoop (Red Hat EL 6.5), there is another POC for this:

https://github.com/cirocavani/tensorflow-poc/tree/master/yarn_training

With this setup, I am running the TF Learn's Wide and Deep Example in Hadoop.

zym1010 commented 7 years ago

I have succeeded in compiling a GPU, Python 3.5 version of TensorFlow 0.10.0 on a CentOS 6 Docker, and it ran well on our university's CentOS 6 cluster. Check https://github.com/leelabcnbc/DevOps/tree/master/Docker/tensorflow/0.10.0/centos6/py35. Basically, it's replacing some hardcoded lines in CROSSTOOL-related items, and adding -lm to everything to prevent errors like #2291. I think Google can make compiling TensorFlow on CentOS less frustrating, if they make some hardcoded stuff link to correct locations.

i3v commented 7 years ago

I've just managed to build tensorflow 0.12rc0 on CentOS6.5, which only had gcc-4.4.7 compiler by default, without having root privileges. (At least, it's successfully passing most simple tests, like this one).

In short, I had to:

  1. Build newer gcc, hardcoding paths to as,ld and nm (a workaround for gcc: error trying to exec 'as': execvp: No such file or directory)

  2. Since I've used gcc, installed to my own $HOME, I had to explicitly specify correct linker library directories here (a workaround for version 'GLIBCXX_3.4.20' not found (required by bazel-out/host/bin/external/protobuf/protoc))

  3. Add -lrt and -lm linker flags to the same place (just like suggested by @zym1010)

Same story, with few more details.

yliu120 commented 7 years ago

I built the latest Tensorflow (github master branch) with GPU support on a supercomputing center (CentOS 6.7 with gcc 4.9.2/Generally with a customized cc tool chain). I pointed out some of environment variables settings that are necessary for a success built. Just to document here for future reference:

http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html

VittalP commented 7 years ago

Thanks @rdipietro ! I have been able to successfully install r0.12 with Bazel 0.4.3 on a cluster. Some of your suggestions needed to be modified to cater to the changes in the new version of TF and Bazel. But, your suggestions provided a solid starting point. When I get the time, I will write up the changes that I had to make.