jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.12k stars 2.76k forks source link

ROCm build fails on Fedora 40 #22650

Open richardfoltyn opened 2 months ago

richardfoltyn commented 2 months ago

Description

I'm trying to build jaxlib-0.4.30 from source on Fedora 40 using ROCm 6.0.2 that comes in their standard repositories.

Fedora dumps all ROCm libraries/headers directly into /usr, and these seem to be found correctly. However, the build fails because the ROCm device libraries are not found, which I suspect is the stuff installed in /usr/lib/clang/17/amdgcn/bitcode:

rpm -ql rocm-device-libs
/usr/lib/clang/17/amdgcn
/usr/lib/clang/17/amdgcn/bitcode
/usr/lib/clang/17/amdgcn/bitcode/asanrtl.bc
/usr/lib/clang/17/amdgcn/bitcode/hip.bc
/usr/lib/clang/17/amdgcn/bitcode/ockl.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_abi_version_400.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_abi_version_500.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_correctly_rounded_sqrt_off.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_daz_opt_off.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_daz_opt_on.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_finite_only_off.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_finite_only_on.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1010.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1011.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1012.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1013.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1030.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1031.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1032.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1033.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1034.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1035.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1036.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1100.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1101.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1102.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1103.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1150.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_1151.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_600.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_601.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_602.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_700.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_701.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_702.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_703.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_704.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_705.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_801.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_802.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_803.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_805.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_810.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_900.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_902.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_904.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_906.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_908.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_909.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_90a.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_90c.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_940.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_941.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_isa_version_942.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_unsafe_math_off.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_unsafe_math_on.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_wavefrontsize64_off.bc
/usr/lib/clang/17/amdgcn/bitcode/oclc_wavefrontsize64_on.bc
/usr/lib/clang/17/amdgcn/bitcode/ocml.bc
/usr/lib/clang/17/amdgcn/bitcode/opencl.bc
/usr/lib64/cmake/AMDDeviceLibs
/usr/lib64/cmake/AMDDeviceLibs/AMDDeviceLibsConfig.cmake
/usr/share/doc/rocm-device-libs
/usr/share/doc/rocm-device-libs/OCKL.md
/usr/share/doc/rocm-device-libs/OCML.md
/usr/share/doc/rocm-device-libs/README.md
/usr/share/licenses/rocm-device-libs
/usr/share/licenses/rocm-device-libs/LICENSE.TXT

Running

python3 ./build/build.py --enable_rocm --rocm_amdgpu_targets=gfx1100,gfx1101 --rocm_path=/usr --python_version=3.11

I get the following error:

[738 / 8,363] Executing genrule @local_config_rocm//rocm:rocm-lib; 3s local ... (24 actions, 23 running)
ERROR: /home/richard/.cache/bazel/_bazel_richard/2be5aaa14d23ec4c14ff72f51bd1b8a9/external/xla/xla/stream_executor/rocm/BUILD:206:13: Compiling xla/stream_executor/rocm/hip_conditional_kernels.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target @xla//xla/stream_executor/rocm:hip_conditional_kernels) 
  (cd /home/richard/.cache/bazel/_bazel_richard/2be5aaa14d23ec4c14ff72f51bd1b8a9/execroot/__main__ && \
  exec env - \
    PATH=/home/richard/.local/bin:/home/richard/.local/miniconda3/condabin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/var/lib/snapd/snap/bin \
    PWD=/proc/self/cwd \
    ROCM_PATH=/usr \
    TF_ROCM_AMDGPU_TARGETS=gfx1100,gfx1101 \
  external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++14' -MD -MF bazel-out/k8-opt/bin/external/xla/xla/stream_executor/rocm/_objs/hip_conditional_kernels/hip_conditional_kernels.cu.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/xla/xla/stream_executor/rocm/_objs/hip_conditional_kernels/hip_conditional_kernels.cu.pic.o' -fPIC '-DBAZEL_CURRENT_REPOSITORY="xla"' -iquote external/xla -iquote bazel-out/k8-opt/bin/external/xla -iquote external/local_config_rocm -iquote bazel-out/k8-opt/bin/external/local_config_rocm -isystem external/local_config_rocm/rocm -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm -isystem external/local_config_rocm/rocm/rocm/include -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include -isystem external/local_config_rocm/rocm/rocm/include/rocrand -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/rocrand -isystem external/local_config_rocm/rocm/rocm/include/roctracer -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/roctracer '-fvisibility=hidden' -Wno-sign-compare -Wno-unknown-warning-option -Wno-stringop-truncation -Wno-array-parameter '-DMLIR_PYTHON_PACKAGE_PREFIX=jaxlib.mlir.' -mavx '-std=c++17' -x rocm '--amdgpu-target=gfx1100' '--amdgpu-target=gfx1101' -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' '-DTENSORFLOW_USE_ROCM=1' -D__HIP_PLATFORM_AMD__ -DEIGEN_USE_HIP -DUSE_ROCM -no-canonical-prefixes -fno-canonical-system-headers -c external/xla/xla/stream_executor/rocm/hip_conditional_kernels.cu.cc -o bazel-out/k8-opt/bin/external/xla/xla/stream_executor/rocm/_objs/hip_conditional_kernels/hip_conditional_kernels.cu.pic.o)
# Configuration: b3c10413bb73176f65c435545cfa027fac61a68a2dc64684a92b98fca12a27fa
# Execution platform: @local_execution_config_platform//:platform
/home/richard/.cache/bazel/_bazel_richard/2be5aaa14d23ec4c14ff72f51bd1b8a9/execroot/__main__/external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc:162: SyntaxWarning: invalid escape sequence '\.'
  re.search('\.cpp$|\.cc$|\.c$|\.cxx$|\.C$', f)]
/home/richard/.cache/bazel/_bazel_richard/2be5aaa14d23ec4c14ff72f51bd1b8a9/execroot/__main__/external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc:23: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
  import pipes
clang: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library

Is there some way to specify --rocm-device-lib-path for the build? I am unfortunately completely unfamiliar with bazel and don't even know where to start looking.

Thanks!

System info (python version, jaxlib version, accelerator, etc.)

jax: 0.4.30
Python: 3.11
OS: Fedora 40
ROCm: 6.0.2
GPU: 7800 XT
hawkinsp commented 2 weeks ago

Sorry, I missed that this had been assigned to me. Is this still a problem?

richardfoltyn commented 1 week ago

Hi,

sorry for taking a while to get back on this.

I realized I can get around the particular issue of missing device files by creating the symlink

ls -la /usr/amdgcn
lrwxrwxrwx. 1 root root 19 Sep 29 12:42 /usr/amdgcn -> lib/clang/17/amdgcn

However, now the build fails at a later stage:

BUILD_DIR=~/build/rocm
mkdir -p ${BUILD_DIR}

cd ${BUILD_DIR}

git clone -b rocm-jaxlib-v0.4.31 https://github.com/ROCm/jax.git
git clone -b rocm-jaxlib-v0.4.31 https://github.com/ROCm/xla.git

cd jax

python3 ./build/build.py --clang_path=/usr/bin/clang-17  --enable_rocm --rocm_amdgpu_targets=gfx1100  --build_gpu_plugin --gpu_plugin_rocm_version=60  --bazel_options=--override_repository=xla=${BUILD_DIR}/xla --rocm_path=/usr --enable_mkl_dnn=false

build-rocm-jax-0.4.31.log

I also tried with the main branch and the default XLA, but that causes a different error which seems to be related to building zlib. build-jax-main.log

Thanks!

Ruturaj4 commented 6 days ago

@richardfoltyn thanks for notifying us the issue.

Actually, we are working on clang patch. Meanwhile, can you try compiling like ->

rm -rf dist; python3.11 -m pip uninstall jax jaxlib jax-rocm60-pjrt jax-rocm60-plugin -y; python3.11 ./build/build.py --use_clang=false --enable_rocm --build_gpu_plugin --gpu_plugin_rocm_version=60 --rocm_amdgpu_targets=[gfxXXX] --bazel_options=--override_repository=xla=[xla_dir] --rocm_path=/opt/rocm-6.2.1/ && python3.11 setup.py develop --user && python3.11 -m pip install dist/*.whl
richardfoltyn commented 3 days ago

Hi @Ruturaj4 ,

It does not seem to make a difference whether I use clang or not. Running the command you suggested (adapted for the Fedora 40 setup),

BUILD_DIR=~/build/rocm
mkdir -p ${BUILD_DIR}

cd ${BUILD_DIR}

git clone -b rocm-jaxlib-v0.4.33 https://github.com/ROCm/jax.git
git clone -b rocm-jaxlib-v0.4.33 https://github.com/ROCm/xla.git

cd jax
python3.11 ./build/build.py --use_clang=false --enable_rocm --build_gpu_plugin --gpu_plugin_rocm_version=60 --rocm_amdgpu_targets=gfx1100 --bazel_options=--override_repository=xla=/home/richard/build/rocm/xla --rocm_path=/usr

produces the same error, see jax-build-no-clang.txt

I know it works on the Ubuntu-based Docker container that AMD/ROCm provide, so it's probably something specific to Fedora.

As you probably know, Fedora ships ROCm-6.1.2 directly in their repos and everything is installed right into /usr as opposed to /opt/rocm-6.x.y

Ruturaj4 commented 2 days ago

@richardfoltyn hmm. I don't have Fedora container to reproduce this error. Can you add -> lib/clang/17/include

^^ This above include here -> https://github.com/openxla/xla/blob/9e28b002070276a852de6b5508224d35d2547d51/third_party/tsl/third_party/gpus/rocm_configure.bzl#L210

And check if it compiles?