rocm-arch / tensorflow-rocm

tensorflow-rocm AUR package
17 stars 12 forks source link

Compiling tensorflow/stream_executor/rocm/rocm_helpers.cu.cc failed #44

Closed BishopWolf closed 9 months ago

BishopWolf commented 2 years ago

The error I get is always the same:

ERROR: /var/tmp/pamac-build-alex/tensorflow-amd/src/tensorflow-2.9.2-amd/tensorflow/stream_executor/rocm/BUILD:416:11: Compiling tensorflow/stream_executor/rocm/rocm_helpers.cu.cc failed: undeclared inclusion(s) in rule '//tensorflow/stream_executor/rocm:rocm_helpers':
this rule is missing dependency declarations for the following files included by 'tensorflow/stream_executor/rocm/rocm_helpers.cu.cc':
  '/opt/rocm/hip/include/hip/hip_version.h'
  '/opt/rocm/hip/include/hip/hip_runtime.h'
  '/opt/rocm/hip/include/hip/hip_common.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_runtime.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_common.h'
  '/opt/rocm/hip/include/hip/hip_runtime_api.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_runtime_api.h'
  '/opt/rocm/hip/include/hip/amd_detail/host_defines.h'
  '/opt/rocm/hip/include/hip/amd_detail/driver_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_texture_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/channel_descriptor.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_vector_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/texture_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_surface_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_ldg.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_atomic.h'
  '/opt/rocm/hip/include/hip/amd_detail/device_functions.h'
  '/opt/rocm/hip/include/hip/amd_detail/math_fwd.h'
  '/opt/rocm/hip/include/hip/hip_vector_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/device_library_decls.h'
  '/opt/rocm/hip/include/hip/amd_detail/llvm_intrinsics.h'
  '/opt/rocm/hip/include/hip/amd_detail/surface_functions.h'
  '/opt/rocm/hip/include/hip/amd_detail/texture_fetch_functions.h'
  '/opt/rocm/hip/include/hip/hip_texture_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/ockl_image.h'
  '/opt/rocm/hip/include/hip/amd_detail/texture_indirect_functions.h'
  '/opt/rocm/hip/include/hip/amd_detail/math_functions.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_fp16_math_fwd.h'
  '/opt/rocm/hip/include/hip/amd_detail/hip_memory.h'
  '/opt/rocm/hip/include/hip/library_types.h'
  '/opt/rocm/hip/include/hip/amd_detail/library_types.h'
petronny commented 2 years ago

After adding the missing dependencies (roctracer and gcc11), I'm also getting this error. Full build log: https://github.com/arch4edu/cactus/actions/runs/3059355491/jobs/4937911118

BishopWolf commented 1 year ago

Currently there are dependencies for a lot of stand alone packages from rocm, fortunately the opencl-amd-dev package is functional and contains all what you possible need.

This will be a step to avoid all rocm dependency problems

petronny commented 1 year ago

The files listed in the error message are not missing. They exist but are just not declared in the rule.

PS. The rule is located in tensorflow/stream_executor/rocm/BUILD.

BishopWolf commented 1 year ago

@petronny May you please enlarge your description. How can I fix this issue?

petronny commented 1 year ago

I haven't figured it out neither. But building from 2.11.0 will be a good start.

At 2.10.0 The rules declared in tensorflow/stream_executor/rocm/BUILD are different to the rules in the tensorflow-amd upstream which are working. And at 2.11.0 they are same now.

However, just upgrading pkgver to 2.11.0 in PKGBUILD won't fix the issue.

lubosz commented 1 year ago

I'm getting the same error for version 2.12.0-3 of the package, using rocm 5.6.0. The build worked fine before, I suppose a rocm update broke it.

ERROR: /home/bmonkey/code/aur/tensorflow-rocm/src/tensorflow-2.12.0-opt-rocm/tensorflow/compiler/xla/stream_executor/rocm/BUILD:406:11: Compiling tensorflow/compiler/xla/stream_executor/rocm/rocm_helpers.cu.cc failed: undeclared inclusion(s) in rule '//tensorflow/compiler/xla/stream_executor/rocm:rocm_helpers':
this rule is missing dependency declarations for the following files included by 'tensorflow/compiler/xla/stream_executor/rocm/rocm_helpers.cu.cc':
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_hip_runtime_wrapper.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/cuda_wrappers/cmath'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/stddef.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_hip_libdevice_declares.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_hip_math.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/cuda_wrappers/algorithm'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/cuda_wrappers/new'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/limits.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/stdint.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_hip_stdlib.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_cuda_math_forward_declares.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_hip_cmath.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__clang_cuda_complex_builtins.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/cuda_wrappers/complex'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/__stddef_max_align_t.h'
  '/opt/rocm/llvm/lib/clang/16.0.0/include/stdarg.h'

Maybe I should also mention that I needed to do following hacks to come this far, as apparently rocm package paths have changed:

sudo ln -s /opt/rocm/bin/hipcc /opt/rocm/hip/bin/hipcc
sudo ln -s /opt/rocm/bin/hipcc.pl /opt/rocm/hip/bin/hipcc.pl

Otherwise I was getting:

sh: line 1: /opt/rocm/hip/bin/hipcc: No such file or directory
Can't open perl script "/opt/rocm/hip/bin//hipcc.pl": No such file or directory
mpeschel10 commented 1 year ago

@lubosz, I never ran into this issue. Have you tried building in a clean chroot? If your rocm installation is borked and pacman -Syu doesn't fix it, doing a chroot build may be more practical than removing and reinstalling the entire rocm toolchain. It is surprisingly easy.

I have started an evening build and will report back tomorrow if it succeeds/fails.

Edit: Forgot to mention that, in addition to making the change export GCC_HOST_COMPILER_PATH=/usr/bin/gcc-12, I also replace all instances of gcc with gcc-12 and g++ with g++-12. Might be important idk.

mpeschel10 commented 1 year ago

nvm, I'm getting the hipcc issue in a clean chroot build.

sh: line 1: /opt/rocm/hip/bin/hipcc: No such file or directory

I will confirm that my own tensorflow-amd-git package still works before getting back to this. Edit: dang, tensorflow-upstream is also broken.

lubosz commented 1 year ago

sh: line 1: /opt/rocm/hip/bin/hipcc: No such file or directory

@mpeschel10 Opened a new bug report for this issue: https://github.com/rocm-arch/tensorflow-rocm/issues/57

lubosz commented 1 year ago

So it turns out this issue here is a bazel feature. It happens when bazel runs into unexpected includes. This can happen due to caching issues as stated above or like in my current case due to actual changes in system includes.

Further reading: https://stackoverflow.com/questions/43921911/how-to-resolve-bazel-undeclared-inclusions-error https://github.com/tensorflow/tensorflow/issues/10665#issuecomment-308931453

In my case, tensorflow maintains a list of llvm rocm headers in their build system. Version by version. That version got bumped to 16, build went bad.

I have a fix for rocm 5.6 available on my branch of this package: https://github.com/lubosz/tensorflow-rocm/commit/9d540c96c60d38f73ab374c1194a7efb34160034

It's fixed on the master branch of tensorflow. Up to llvm 17. Future proof.

mpeschel10 commented 1 year ago

I have a fix for rocm 5.6 available on my branch of this package: lubosz@9d540c9

I confirm that this PKGBUILD builds without errors and appears to be as functional as it was before the 5.6 update. I did manually link /opt/rocm/hip/bin/hipcc to /opt/rocm/bin/hipcc, so I can't confirm if this also resolves issue #57, but it's probably cool.

(I can't thoroughly test it; I get a std::bad_variant_access, when I call model.fit(), but that was happening before the update.)

acxz commented 1 year ago

It's fixed on the master branch of tensorflow. Up to llvm 17. Future proof.

@lubosz thanks for the detailed investigation! Can you link the exact commit where the change occurs in upstream tensorflow?

Edit: Found it here: https://github.com/tensorflow/tensorflow/commit/c97cec76fc145c25543b0e7545d5ea3ad4f8e764

acxz commented 9 months ago

closed since 2.15.0 has the fix