Closed MilesCranmer closed 4 years ago
Do you know if you have cublas installed on your system? That's what your error message is about.
It should be all there:
(main2) ➜ ls | grep cublas
cublas_api.h
cublas.h
cublasLt.h
cublas_v2.h
cublasXt.h
(main2) ➜ pwd
/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/targets/x86_64-linux/include
Looking through https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl, I started to wonder if some of these environment variables would be needed. I set this one:
export TF_CUDA_PATHS=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
and now I can get a bit further in the build process. Will update this post when it's finished (hopefully successfully).
Okay, I finally built JAX on CentOS 7 and confirmed GPU support!! 🎉🎉🎉🎉
(I use the Bazel built by the JAX installer, rather than using one I built from source as I previously tried)
Here's my solution.
echo $LD_LIBRARY_PATH
, it should be the folder before lib64 for the cuda files. Mine is here: /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/lib64
.export TF_CUDA_PATHS=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
This is the folder with the following (yours might be slightly different, but should have include and lib64):
(main2) ➜ ~ ls /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
bin extras lib64 NsightCompute-2019.1 nvml share tools
doc include libnsight nsightee_plugins nvvm src version.txt
EULA.txt jre libnvvp NsightSystems-2018.3 samples targets
/cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
. This folder contains: include
and lib64
.git clone --depth=1 https://github.com/google/jax
python build/build.py --enable_march_native --enable_cuda --cuda_path /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --cudnn_path /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
For your system, you need to change /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
to the folder you passed TF_CUDA_PATHS
and /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
to the folder you found for cuDNN.
pip install -e build
and then pip install -e .
. Thank you for sharing these great instructions!
This didn't work for me on Centos 7, looks like a Bazel issue. Did you build Bazel from source or just using yum? And which version of Bazel was used?
I think the JAX installer will build its own Bazel. I ended up just using that one.
Ah I see, thanks for clarifying - got confused by your comment in the original post that you tried an externally built one.
Oops, sorry, let me make it clear in the instructions for future users!
Hi, Thank you for sharing. I couldn't get it work. Could you explain what you mean by "Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables."? Also I don't have a cudnn directory. The cudnn.h file is in usr/include and I'm using that.
"Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables."?
He means that the folders for all the requirements from the step above ("CUDA, cuDNN, gcc, openMPI installed.") should be on those environment variables. For your case if you do e.g. echo $LD_LIBRARY_PATH
, then you should get all the folders [which contain those prerequisites] on that path and for instance cuDNN should be there. E.g. on my server:
Currently Loaded Modules:
1) CUDA/10.0.130 3) GCCcore/8.3.0 5) binutils/2.32-GCCcore-8.3.0 7) numactl/2.0.12-GCCcore-8.3.0 9) libxml2/2.9.9-GCCcore-8.3.0 11) hwloc/2.0.3-GCCcore-8.3.0
2) cuDNN/7.6.4.38-CUDA-10.0.130 4) zlib/1.2.11-GCCcore-8.3.0 6) GCC/8.3.0-2.32 8) XZ/5.2.4-GCCcore-8.3.0 10) libpciaccess/0.14-GCCcore-8.3.0 12) OpenMPI/4.0.1-GCC-8.3.0-2.32
And:
echo $LD_LIBRARY_PATH
/blablab/el7/OpenMPI/4.0.1-GCC-8.3.0-2.32/lib:/blbalba/el7/hwloc/2.0.3-GCCcore-8.3.0/lib:/appl/opt/libpciaccess/0.14-GCCcore-8.3.0/lib:/blbalba/libxml2/2.9.9-GCCcore-8.3.0/lib:/blbalba/XZ/5.2.4-GCCcore-8.3.0/lib:/blbalba/numactl/2.0.12-GCCcore-8.3.0/lib:/blbalba/binutils/2.32-GCCcore-8.3.0/lib:/blbalba/zlib/1.2.11-GCCcore-8.3.0/lib:/blbalba/GCCcore/8.3.0/lib64:/blbalba/GCCcore/8.3.0/lib:/blbalba/cuDNN/7.6.4.38-CUDA-10.0.130/lib64:/blbalba/CUDA/10.0.130/nvvm/lib64:/blbalba/CUDA/10.0.130/extras/CUPTI/lib64:/blbalba/CUDA/10.0.130/lib64:/blbalba/centos/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/lib:/blbalba/centos/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/lib:/blbalba/centos/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib:
So you can see the cuDNN path here is: /blbalba/cuDNN/7.6.4.38-CUDA-10.0.130
Thank you for the detailed explanation!
While trying to compile the library with bazel on a HPC cluster, I got the following error after 10+ min compilation. For compatibility reasons, I am trying to compile an older JAX version (https://github.com/google/jax/commit/de645c5b8bba6c8d3e0c82d7f8f62cdde137bbcb):
/home-084/username/jax_0_76/jax/jaxlib/BUILD:92:1: Linking of rule '//jaxlib:pytree.so' failed (Exit 1) .... .... /opt/software/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0/libgcc.a: error adding symbols: File format not recognized
These are the modules I loaded:
GCCcore/8.3.0 GCC/8.3.0
CUDA/10.1.243 cuDNN/7.6.4.38
OpenMPI/3.1.4 imkl/2019.5.281
I am aware this is not exactly JAX related but I want to try my chance here
Hi,
I have been having trouble getting JAX to work on CentOS 7 and was wondering if I could have some help. If I manage to get it working I'll document how I did it here.
First, some build system stats:
Following @shoyer's advice in #1948, I switched to a self-built bazel. This eliminated some issues but now I have others.
The advice in #1659 of setting the correct cuda path was not applicable to me.
For the record I have TensorFlow working with GPUs from a wheel. I also have JAX working for CPU-only from the conda-forge version.
Here is my current command and problem:
Command:
Output:
I've also seen some other build issues - they seem to come and go. I'll document them if I save the log next time.
Please let me know if you have any tips. Thanks! Miles