jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.46k stars 2.8k forks source link

Errors when building on AMD GPU #19989

Open markliuchina opened 8 months ago

markliuchina commented 8 months ago

Description

Hi!I am very interested in using STARRED (a python package designed for astronomical data processing with GPU support). However, it depends on jax and jaxlib.

I followed the instructions from your documentation website and build with the following scripts: python build/build.py --enable_rocm --rocm_path=/opt/rocm-6.0.2 --bazel_options=--override_repository=xla=/home/phylmf/lib/xla

I am sure the rocm is prepared well and I can use pytorch and stable diffusion with no troubles.

Anyway, the building process stopped and reports:

[637 / 2,481] Compiling xla/hlo/evaluator/hlo_evaluator.cc; 16s local ... (12 actions running)
[640 / 2,481] Compiling xla/hlo/evaluator/hlo_evaluator.cc; 18s local ... (12 actions, 11 running)
[640 / 2,481] Compiling xla/hlo/evaluator/hlo_evaluator.cc; 19s local ... (12 actions running)
ERROR: /home/phylmf/.cache/bazel/_bazel_phylmf/23e2a325a95637686413138cca4e49b3/external/xla/xla/service/gpu/BUILD:1434:23: Compiling xla/service/gpu/cub_sort_kernel.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target @xla//xla/service/gpu:cub_sort_kernel_u64_b64) 
  (cd /home/phylmf/.cache/bazel/_bazel_phylmf/23e2a325a95637686413138cca4e49b3/execroot/__main__ && \
  exec env - \
    LD_LIBRARY_PATH=/home/phylmf/lib/MultiNest/lib/:/home/phylmf/anaconda3/pkgs/mpi-1.0-mpich/lib/ \
    PATH=/home/phylmf/lib/idl_lib/idlutils/bin:/home/phylmf/anaconda3/bin:/home/phylmf/anaconda3/condabin:/home/phylmf/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin \
    PWD=/proc/self/cwd \
    ROCM_PATH=/opt/rocm-6.0.2 \
    TF_ROCM_AMDGPU_TARGETS=gfx900,gfx906,gfx908,gfx90a,gfx1030 \
  external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++14' -MD -MF bazel-out/k8-opt/bin/external/xla/xla/service/gpu/_objs/cub_sort_kernel_u64_b64/cub_sort_kernel.cu.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/xla/xla/service/gpu/_objs/cub_sort_kernel_u64_b64/cub_sort_kernel.cu.pic.o' -fPIC '-DEIGEN_MAX_ALIGN_BYTES=64' -DEIGEN_ALLOW_UNALIGNED_SCALARS '-DEIGEN_USE_AVX512_GEMM_KERNELS=0' '-DTENSORFLOW_USE_ROCM=1' -DCUB_TYPE_U64_B64 '-DBAZEL_CURRENT_REPOSITORY="xla"' -iquote external/xla -iquote bazel-out/k8-opt/bin/external/xla -iquote external/eigen_archive -iquote bazel-out/k8-opt/bin/external/eigen_archive -iquote external/tsl -iquote bazel-out/k8-opt/bin/external/tsl -iquote external/local_config_rocm -iquote bazel-out/k8-opt/bin/external/local_config_rocm -isystem external/eigen_archive -isystem bazel-out/k8-opt/bin/external/eigen_archive -isystem external/eigen_archive/mkl_include -isystem bazel-out/k8-opt/bin/external/eigen_archive/mkl_include -isystem external/local_config_rocm/rocm -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm -isystem external/local_config_rocm/rocm/rocm/include/hipcub -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/hipcub -isystem external/local_config_rocm/rocm/rocm/include/rocprim -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/rocprim -isystem external/local_config_rocm/rocm/rocm/include -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include -isystem external/local_config_rocm/rocm/rocm/include/rocrand -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/rocrand -isystem external/local_config_rocm/rocm/rocm/include/roctracer -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/roctracer '-fvisibility=hidden' -Wno-sign-compare -Wno-unknown-warning-option -Wno-stringop-truncation -Wno-array-parameter '-DMLIR_PYTHON_PACKAGE_PREFIX=jaxlib.mlir.' -mavx '-std=c++17' -x rocm '--amdgpu-target=gfx900' '--amdgpu-target=gfx906' '--amdgpu-target=gfx908' '--amdgpu-target=gfx90a' '--amdgpu-target=gfx1030' -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' '-DTENSORFLOW_USE_ROCM=1' -D__HIP_PLATFORM_AMD__ -DEIGEN_USE_HIP -no-canonical-prefixes -fno-canonical-system-headers -c external/xla/xla/service/gpu/cub_sort_kernel.cu.cc -o bazel-out/k8-opt/bin/external/xla/xla/service/gpu/_objs/cub_sort_kernel_u64_b64/cub_sort_kernel.cu.pic.o)
# Configuration: 2fefc4c2633450a982f0e7bdbf0123ec0a48cb805d729dfac1feb2058e0090a6
# Execution platform: @local_execution_config_platform//:platform
clang: warning: argument unused during compilation: '-fgpu-flush-denormals-to-zero' [-Wunused-command-line-argument]
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr52 = V_MOV_B32_dpp undef $vgpr52(tied-def 0), $vgpr12, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), killed $vgpr3, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr2, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr7 = V_MOV_B32_dpp undef $vgpr7(tied-def 0), killed $vgpr0, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr0 = V_MOV_B32_dpp undef $vgpr0(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr42, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr40 = V_MOV_B32_dpp undef $vgpr40(tied-def 0), $vgpr38, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr43 = V_MOV_B32_dpp undef $vgpr43(tied-def 0), $vgpr39, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr42, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr40 = V_MOV_B32_dpp undef $vgpr40(tied-def 0), $vgpr38, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr43 = V_MOV_B32_dpp undef $vgpr43(tied-def 0), $vgpr39, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr52 = V_MOV_B32_dpp undef $vgpr52(tied-def 0), $vgpr20, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr42, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr40 = V_MOV_B32_dpp undef $vgpr40(tied-def 0), $vgpr38, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr43 = V_MOV_B32_dpp undef $vgpr43(tied-def 0), $vgpr39, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr42, 322, 15, 15, 0, implicit $exec
fatal error: too many errors emitted, stopping now [-ferror-limit=]
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
renamable $vgpr40 = V_MOV_B32_dpp undef $vgpr40(tied-def 0), $vgpr38, 322, 15, 15, 0, implicit $exec
renamable $vgpr43 = V_MOV_B32_dpp undef $vgpr43(tied-def 0), $vgpr39, 322, 15, 15, 0, implicit $exec
20 errors generated when compiling for gfx1034.
Target //jaxlib/tools:build_wheel failed to build
INFO: Elapsed time: 307.781s, Critical Path: 46.77s
INFO: 650 processes: 13 internal, 637 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
Traceback (most recent call last):
  File "/home/phylmf/lib/jax/build/build.py", line 706, in <module>
    main()
  File "/home/phylmf/lib/jax/build/build.py", line 674, in main
    shell(command)
  File "/home/phylmf/lib/jax/build/build.py", line 45, in shell
    output = subprocess.check_output(cmd)
  File "/home/phylmf/anaconda3/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/phylmf/anaconda3/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/bazel', 'run', '--verbose_failures=true', '//jaxlib/tools:build_wheel', '--', '--output_path=/home/phylmf/lib/jax/dist', '--jaxlib_git_hash=fab8f6cfdd1065d02bc48c3f7d65e31d0979d3c6', '--cpu=x86_64']' returned non-zero exit status 1.

Anything from you will be appreciated!

Thanks a lot in making jax/jaxlib available!

Mingfeng Liu Nanjing Normal University

System info (python version, jaxlib version, accelerator, etc.)

(base) phylmf@nnu-astro-Precision-3260:~$ rocminfo ROCk module is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents
==========


Agent 1


Name: 12th Gen Intel(R) Core(TM) i5-12500 Uuid: CPU-XX
Marketing Name: 12th Gen Intel(R) Core(TM) i5-12500 Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4600
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32540432(0x1f08710) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32540432(0x1f08710) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32540432(0x1f08710) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1030
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6400
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 1024(0x400) KB
L3: 16384(0x4000) KB
Chip ID: 29759(0x743f)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2320
BDFID: 768
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 116
SDMA engine uCode:: 34
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 4177920(0x3fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 4177920(0x3fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1030
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

rahulbatra85 commented 8 months ago

Hi, currently JAX on ROCM is supported only for MI Instinct GPUs. We are working to get support for Navi/Radeon in the near future.