Error compile: --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" issues with clang & gcc

benkirk commented 2 months ago

Description

I'm attempting to build jaxlib with a local CUDA, CUDNN, and NCCL. I'm running into (different) issues with either gcc of clang. Any ideas??:

Build command:

python build/build.py \
       --build_gpu_plugin --gpu_plugin_cuda_version=12 \
       --verbose \
       --enable_mkl_dnn \
       --enable_nccl \
       --enable_cuda \
       --cuda_compute_capabilities 8.0 \
       --target_cpu_features release \
       --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

`clang` error:

external/tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found

`gcc` error:

# Configuration: d3d6c18c79c5128461901902331e6ad5ab5bc83fb9ca1bc29bc506f7fe919c16
# Execution platform: @local_execution_config_platform//:platform
gcc: error: unrecognized command-line option '--cuda-path=external/cuda_nvcc'

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.4.31
jaxlib: 0.4.31
numpy:  2.1.1
python: 3.11.10 | packaged by conda-forge | (main, Sep 10 2024, 11:01:28) [GCC 13.3.0]
jax.devices (2 total, 2 local): [CudaDevice(id=0) CudaDevice(id=1)]
process_count: 1
platform: uname_result(system='Linux', node='derecho7', release='5.14.21-150400.24.18-default', version='#1 SMP PREEMPT_DYNAMIC Thu Aug 4 14:17:48 UTC 2022 (e9f7bfc)', machine='x86_64')

$ nvidia-smi
Wed Sep 11 12:37:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                    0 |
| N/A   51C    P0              68W / 300W |    429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:C3:00.0 Off |                    0 |
| N/A   53C    P0              75W / 300W |    429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     54137      C   python                                      416MiB |
|    1   N/A  N/A     54137      C   python                                      416MiB |
+---------------------------------------------------------------------------------------+

johnnynunez commented 2 months ago

You have to put the cuda version and cudnn version unfortunaly. Clang not detect automatically. If you are using this setup. Maybe is better that you use JAX Toolbox. https://github.com/NVIDIA/JAX-Toolbox

ybaturina commented 2 months ago

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"

Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

johnnynunez commented 2 months ago

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:
 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"
Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

Hi @benkirk I'm going to update JAX docs with the link to XLA instructions.

From your command, I see that you provided environment variables:
 --bazel_options=--repo_env=LOCAL_CUDA_PATH="${CUDA_HOME}" \
       --bazel_options=--repo_env=LOCAL_CUDNN_PATH="${NCAR_ROOT_CUDNN}" \
       --bazel_options=--repo_env=LOCAL_NCCL_PATH="${PREFIX}"
Would you provide values of ${CUDA_HOME}, ${NCAR_ROOT_CUDNN} and ${PREFIX} here please?

the problem is here: https://github.com/openxla/xla/issues/16877

I avoid a lot of problems. https://github.com/dusty-nv/jetson-containers/pull/626

johnnynunez commented 2 months ago

Also, this is necessary: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/install-cudnn.sh and this: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/build-jax.sh

ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib

johnnynunez commented 2 months ago

Also, this is necessary: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/install-cudnn.sh and this: https://github.com/NVIDIA/JAX-Toolbox/blob/main/.github/container/build-jax.sh
ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib

I've update the script to not download the files.

#!/bin/bash

set -e

CUDNN_MAJOR_VERSION=9
CUDA_MAJOR_VERSION=12.2
prefix=/opt/nvidia/cudnn
arch=$(uname -m)-linux-gnu
cuda_base_path="/usr/local/cuda-${CUDA_MAJOR_VERSION}"

# Comprobar si la ruta especificada de CUDA existe
if [[ -d "${cuda_base_path}" ]]; then
  cuda_lib_path="${cuda_base_path}/lib64"
  output_path="/usr/local/cuda-${CUDA_MAJOR_VERSION}/lib"
else
  cuda_lib_path="/usr/local/cuda/lib64"
  output_path="/usr/local/cuda/lib64"
fi

# Crear enlace simbólico para CUDA
sudo ln -s "${cuda_lib_path}" "${output_path}"

# Proceso para CUDNN
for cudnn_file in $(dpkg -L libcudnn${CUDNN_MAJOR_VERSION} libcudnn${CUDNN_MAJOR_VERSION}-dev | sort -u); do
  if [[ -f "${cudnn_file}" || -h "${cudnn_file}" ]]; then
    nosysprefix="${cudnn_file#"/usr/"}"
    noarchinclude="${nosysprefix/#"include/${arch}"/include}"
    noverheader="${noarchinclude/%"_v${CUDNN_MAJOR_VERSION}.h"/.h}"
    noarchlib="${noverheader/#"lib/${arch}"/lib}"

    # Usar la ruta cuda_base_path o /usr/local/cuda/lib64
    if [[ -d "${cuda_base_path}" ]]; then
      link_name="${cuda_base_path}/${noarchlib}"
    else
      link_name="/usr/local/cuda/lib64/${noarchlib}"
    fi

    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${cudnn_file}" "${link_name}"
  fi
done

benkirk commented 2 months ago

Thank you both, in my case

 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1" \
 --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cudnn/9.2.0.82-12" \
 --bazel_options=--repo_env=LOCAL_NCCL_PATH="<my_conda_build_prefix>"

I'll attempt providing the version strings on the command line as well and follow XLA instructions.

Building from source without a container definitely wasn't my first choice, but we do have need for a site-provided NCCL on this machine, it has a proprietary vendor network - Slingshot 11 - that needs some care & feeding.

johnnynunez commented 2 months ago

Thank you both, in my case
 --bazel_options=--repo_env=LOCAL_CUDA_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1" \
 --bazel_options=--repo_env=LOCAL_CUDNN_PATH="/glade/u/apps/common/23.08/spack/opt/spack/cudnn/9.2.0.82-12" \
 --bazel_options=--repo_env=LOCAL_NCCL_PATH="<my_conda_build_prefix>"
I'll attempt providing the version strings on the command line as well and follow XLA instructions.

Building from source without a container definitely wasn't my first choice, but we do have need for a site-provided NCCL on this machine, it has a proprietary vendor network - Slingshot 11 - that needs some care & feeding.

yeah but not works, because I mention before that cuda needs lib not lib64. And cudnn needs to be renamed mainting certain structure. It's very tricky. On 0.4.31 release, it was with cuda_path etc that was easier, but now, jax use xla hermetic cuda that runs automatically everything....

hawkinsp commented 2 months ago

@benkirk You don't need to build JAX from source to use a custom NCCL. We'll use whichever libnccl.so we find in your LD_LIBRARY_PATH.

benkirk commented 2 months ago

Thanks @hawkinsp, I've got my NCCL injected with jax[cuda12]=0.4.31 properly from PIP, had a few issues trying jax[cuda12_local]=0.4.31 ; I'll revisit that as an alternative parallel path.

ybaturina commented 2 months ago

yeah but not works, because I mention before that cuda needs lib not lib64. And cudnn needs to be renamed mainting certain structure. It's very tricky. On 0.4.31 release, it was with cuda_path etc that was easier, but now, jax use xla hermetic cuda that runs automatically everything....

hi @johnnynunez, I understand your concerns, I tried to address them in the comment here.

jax-ml / jax