jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.29k stars 2.78k forks source link

JAX build on CentOS 7 (RHEL variant) #2083

Closed MilesCranmer closed 4 years ago

MilesCranmer commented 4 years ago

Hi,

I have been having trouble getting JAX to work on CentOS 7 and was wondering if I could have some help. If I manage to get it working I'll document how I did it here.

First, some build system stats:

Currently Loaded Modulefiles:
 1) cuda/10.1.105_418.39     3) gcc/8.3.0                     5) slurm/18.08.8
 2) cudnn/v7.6.2-cuda-10.1   4) lib/openblas/0.2.19-haswell   6) openmpi/1.10.7-hfi

conda: 4.8.1
python: 3.7.6

Following @shoyer's advice in #1948, I switched to a self-built bazel. This eliminated some issues but now I have others.

The advice in #1659 of setting the correct cuda path was not applicable to me.

For the record I have TensorFlow working with GPUs from a wheel. I also have JAX working for CPU-only from the conda-forge version.

Here is my current command and problem:

Command:

 python build/build.py --enable_march_native --enable_cuda --cuda_path /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --cudnn_path /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1 --bazel_path /mnt/home/mcranmer/Downloads/bazel/output/bazel 2>&1 > bazel_build_log.txt

Output:

WARNING: Output base '/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd' is on NFS. This may lead to surprising failures and undetermined behavior.
Starting local Bazel server and connecting to it...
WARNING: Output base '/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd' is on NFS. This may lead to surprising failures and undetermined behavior.
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'run' from /mnt/home/mcranmer/Downloads/jax/.bazelrc:
  Inherited 'build' options: --repo_env PYTHON_BIN_PATH=/mnt/home/mcranmer/miniconda3/envs/main2/bin/python --python_path=/mnt/home/mcranmer/miniconda3/envs/main2/bin/python --repo_env TF_NEED_CUDA=1 --distinct_host_configuration=false --copt=-Wno-sign-compare -c opt --apple_platform_type=macos --macos_minimum_os=10.9 --announce_rc --define=no_aws_support=true --define=no_gcp_support=true --define=no_hdfs_support=true --define=no_kafka_support=true --define=no_ignite_support=true --define=grpc_no_ares=true --spawn_strategy=standalone --strategy=Genrule=standalone --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --action_env CUDA_TOOLKIT_PATH=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --action_env CUDNN_INSTALL_PATH=/cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
INFO: Found applicable config definition build:opt in file /mnt/home/mcranmer/Downloads/jax/.bazelrc: --copt=-march=native --host_copt=-march=native
INFO: Found applicable config definition build:mkl_open_source_only in file /mnt/home/mcranmer/Downloads/jax/.bazelrc: --define=tensorflow_mkldnn_contraction_kernel=1
INFO: Found applicable config definition build:cuda in file /mnt/home/mcranmer/Downloads/jax/.bazelrc: --crosstool_top=@local_config_cuda//crosstool:toolchain --define=using_cuda=true --define=using_cuda_nvcc=true
Loading: 
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
    currently loading: build
INFO: Call stack for the definition of repository 'local_config_cuda' which is a cuda_configure (rule definition at /mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl:1306:18):
 - /mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/tensorflow/workspace.bzl:87:5
 - /mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/tensorflow/workspace.bzl:77:5
 - /mnt/home/mcranmer/Downloads/jax/WORKSPACE:46:1
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1304
        _create_local_cuda_repository(<1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1006, in _create_local_cuda_repository
        _get_cuda_config(repository_ctx)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 729, in _get_cuda_config
        find_cuda_config(repository_ctx, <1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 709, in find_cuda_config
        auto_configure_fail(<1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 340, in auto_configure_fail
        fail(<1 more arguments>)

Cuda Configuration Error: Failed to run find_cuda_config.py: Could not find any cublas_api.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/cm/local/apps/cmd/lib'
        '/cm/local/apps/mysql++/current/lib'
        '/lib'
        '/lib64'
        '/opt/dell/srvadmin/lib64'
        '/opt/dell/srvadmin/lib64/openmanage'
        '/opt/dell/srvadmin/lib64/openmanage/smpop'
        '/opt/dell/toolkit/bin'
        '/usr'
        '/usr/lib64//bind9-export'
        '/usr/lib64/R/lib'
        '/usr/lib64/atlas'
        '/usr/lib64/dyninst'
        '/usr/lib64/mysql'
        '/usr/lib64/octave/3.8.2'
        '/usr/lib64/tcl8.5'
        '/usr/lib64/vtk'

ERROR: Skipping ':install_xla_in_source_tree': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1304
        _create_local_cuda_repository(<1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1006, in _create_local_cuda_repository
        _get_cuda_config(repository_ctx)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 729, in _get_cuda_config
        find_cuda_config(repository_ctx, <1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 709, in find_cuda_config
        auto_configure_fail(<1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 340, in auto_configure_fail
        fail(<1 more arguments>)

Cuda Configuration Error: Failed to run find_cuda_config.py: Could not find any cublas_api.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/cm/local/apps/cmd/lib'
        '/cm/local/apps/mysql++/current/lib'
        '/lib'
        '/lib64'
        '/opt/dell/srvadmin/lib64'
        '/opt/dell/srvadmin/lib64/openmanage'
        '/opt/dell/srvadmin/lib64/openmanage/smpop'
        '/opt/dell/toolkit/bin'
        '/usr'
        '/usr/lib64//bind9-export'
        '/usr/lib64/R/lib'
        '/usr/lib64/atlas'
        '/usr/lib64/dyninst'
        '/usr/lib64/mysql'
        '/usr/lib64/octave/3.8.2'
        '/usr/lib64/tcl8.5'
        '/usr/lib64/vtk'

WARNING: Target pattern parsing failed.
ERROR: no such package '@local_config_cuda//cuda': Traceback (most recent call last):
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1304
        _create_local_cuda_repository(<1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1006, in _create_local_cuda_repository
        _get_cuda_config(repository_ctx)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 729, in _get_cuda_config
        find_cuda_config(repository_ctx, <1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 709, in find_cuda_config
        auto_configure_fail(<1 more arguments>)
    File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 340, in auto_configure_fail
        fail(<1 more arguments>)

Cuda Configuration Error: Failed to run find_cuda_config.py: Could not find any cublas_api.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/cm/local/apps/cmd/lib'
        '/cm/local/apps/mysql++/current/lib'
        '/lib'
        '/lib64'
        '/opt/dell/srvadmin/lib64'
        '/opt/dell/srvadmin/lib64/openmanage'
        '/opt/dell/srvadmin/lib64/openmanage/smpop'
        '/opt/dell/toolkit/bin'
        '/usr'
        '/usr/lib64//bind9-export'
        '/usr/lib64/R/lib'
        '/usr/lib64/atlas'
        '/usr/lib64/dyninst'
        '/usr/lib64/mysql'
        '/usr/lib64/octave/3.8.2'
        '/usr/lib64/tcl8.5'
        '/usr/lib64/vtk'

INFO: Elapsed time: 618.998s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully (0 packages loaded)

     _   _  __  __
    | | / \ \ \/ /
 _  | |/ _ \ \  /
| |_| / ___ \/  \
 \___/_/   \/_/\_\

Bazel binary path: /mnt/home/mcranmer/Downloads/bazel/output/bazel
Python binary path: /mnt/home/mcranmer/miniconda3/envs/main2/bin/python
MKL-DNN enabled: yes
-march=native: yes
CUDA enabled: yes
CUDA toolkit path: /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
CUDNN library path: /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1

Building XLA and installing it in the jaxlib source tree...
/mnt/home/mcranmer/Downloads/bazel/output/bazel run --verbose_failures=true --config=opt --config=mkl_open_source_only --config=cuda --define=xla_python_enable_gpu=true :install_xla_in_source_tree /mnt/home/mcranmer/Downloads/jax/build
Traceback (most recent call last):
  File "build/build.py", line 351, in <module>
    main()
  File "build/build.py", line 346, in main
    shell(command)
  File "build/build.py", line 50, in shell
    output = subprocess.check_output(cmd)
  File "/mnt/home/mcranmer/miniconda3/envs/main2/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/mnt/home/mcranmer/miniconda3/envs/main2/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/mnt/home/mcranmer/Downloads/bazel/output/bazel', 'run', '--verbose_failures=true', '--config=opt', '--config=mkl_open_source_only', '--config=cuda', '--define=xla_python_enable_gpu=true', ':install_xla_in_source_tree', '/mnt/home/mcranmer/Downloads/jax/build']' returned non-zero exit status 1.

I've also seen some other build issues - they seem to come and go. I'll document them if I save the log next time.

Please let me know if you have any tips. Thanks! Miles

shoyer commented 4 years ago

Do you know if you have cublas installed on your system? That's what your error message is about.

MilesCranmer commented 4 years ago

It should be all there:

(main2) ➜   ls | grep cublas
cublas_api.h
cublas.h
cublasLt.h
cublas_v2.h
cublasXt.h
(main2) ➜   pwd
/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/targets/x86_64-linux/include

Looking through https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl, I started to wonder if some of these environment variables would be needed. I set this one:

export TF_CUDA_PATHS=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39

and now I can get a bit further in the build process. Will update this post when it's finished (hopefully successfully).

MilesCranmer commented 4 years ago

Okay, I finally built JAX on CentOS 7 and confirmed GPU support!! 🎉🎉🎉🎉

(I use the Bazel built by the JAX installer, rather than using one I built from source as I previously tried)

Here's my solution.

  1. Make sure you have CUDA, cuDNN, gcc, openMPI installed.
  2. Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables.
  3. Find your CUDA toolkit (if you do echo $LD_LIBRARY_PATH, it should be the folder before lib64 for the cuda files. Mine is here: /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/lib64.
  4. Add the folder before lib64 to TF_CUDA_PATHS with
    export TF_CUDA_PATHS=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39

    This is the folder with the following (yours might be slightly different, but should have include and lib64):

    (main2) ➜  ~ ls /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
    bin       extras   lib64      NsightCompute-2019.1  nvml     share    tools
    doc       include  libnsight  nsightee_plugins      nvvm     src      version.txt
    EULA.txt  jre      libnvvp    NsightSystems-2018.3  samples  targets
  5. Find your cuDNN installation folder. Mine is here: /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1. This folder contains: include and lib64.
  6. Go to the jax folder which you can get with git clone --depth=1 https://github.com/google/jax
  7. In the jax folder, run the following command (change these folders to yours):
    python build/build.py --enable_march_native --enable_cuda --cuda_path /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --cudnn_path /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1

    For your system, you need to change /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 to the folder you passed TF_CUDA_PATHS and /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1 to the folder you found for cuDNN.

  8. This will take a long time to compile. After it has finished successfully, follow the commands in the wiki to install it, which are: pip install -e build and then pip install -e ..
skye commented 4 years ago

Thank you for sharing these great instructions!

HHalva commented 4 years ago

This didn't work for me on Centos 7, looks like a Bazel issue. Did you build Bazel from source or just using yum? And which version of Bazel was used?

MilesCranmer commented 4 years ago

I think the JAX installer will build its own Bazel. I ended up just using that one.

HHalva commented 4 years ago

Ah I see, thanks for clarifying - got confused by your comment in the original post that you tried an externally built one.

MilesCranmer commented 4 years ago

Oops, sorry, let me make it clear in the instructions for future users!

ElhamSol commented 4 years ago

Hi, Thank you for sharing. I couldn't get it work. Could you explain what you mean by "Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables."? Also I don't have a cudnn directory. The cudnn.h file is in usr/include and I'm using that.

HHalva commented 4 years ago

"Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables."?

He means that the folders for all the requirements from the step above ("CUDA, cuDNN, gcc, openMPI installed.") should be on those environment variables. For your case if you do e.g. echo $LD_LIBRARY_PATH, then you should get all the folders [which contain those prerequisites] on that path and for instance cuDNN should be there. E.g. on my server:

Currently Loaded Modules:
  1) CUDA/10.0.130                  3) GCCcore/8.3.0               5) binutils/2.32-GCCcore-8.3.0   7) numactl/2.0.12-GCCcore-8.3.0   9) libxml2/2.9.9-GCCcore-8.3.0      11) hwloc/2.0.3-GCCcore-8.3.0
  2) cuDNN/7.6.4.38-CUDA-10.0.130   4) zlib/1.2.11-GCCcore-8.3.0   6) GCC/8.3.0-2.32                8) XZ/5.2.4-GCCcore-8.3.0        10) libpciaccess/0.14-GCCcore-8.3.0  12) OpenMPI/4.0.1-GCC-8.3.0-2.32

And:

echo $LD_LIBRARY_PATH
/blablab/el7/OpenMPI/4.0.1-GCC-8.3.0-2.32/lib:/blbalba/el7/hwloc/2.0.3-GCCcore-8.3.0/lib:/appl/opt/libpciaccess/0.14-GCCcore-8.3.0/lib:/blbalba/libxml2/2.9.9-GCCcore-8.3.0/lib:/blbalba/XZ/5.2.4-GCCcore-8.3.0/lib:/blbalba/numactl/2.0.12-GCCcore-8.3.0/lib:/blbalba/binutils/2.32-GCCcore-8.3.0/lib:/blbalba/zlib/1.2.11-GCCcore-8.3.0/lib:/blbalba/GCCcore/8.3.0/lib64:/blbalba/GCCcore/8.3.0/lib:/blbalba/cuDNN/7.6.4.38-CUDA-10.0.130/lib64:/blbalba/CUDA/10.0.130/nvvm/lib64:/blbalba/CUDA/10.0.130/extras/CUPTI/lib64:/blbalba/CUDA/10.0.130/lib64:/blbalba/centos/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/lib:/blbalba/centos/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/lib:/blbalba/centos/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib:

So you can see the cuDNN path here is: /blbalba/cuDNN/7.6.4.38-CUDA-10.0.130

cagrikymk commented 3 years ago

Thank you for the detailed explanation!

While trying to compile the library with bazel on a HPC cluster, I got the following error after 10+ min compilation. For compatibility reasons, I am trying to compile an older JAX version (https://github.com/google/jax/commit/de645c5b8bba6c8d3e0c82d7f8f62cdde137bbcb):

/home-084/username/jax_0_76/jax/jaxlib/BUILD:92:1: Linking of rule '//jaxlib:pytree.so' failed (Exit 1) .... .... /opt/software/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0/libgcc.a: error adding symbols: File format not recognized

These are the modules I loaded: GCCcore/8.3.0 GCC/8.3.0
CUDA/10.1.243 cuDNN/7.6.4.38
OpenMPI/3.1.4 imkl/2019.5.281

I am aware this is not exactly JAX related but I want to try my chance here