rocm-arch / tensorflow-rocm

tensorflow-rocm AUR package
17 stars 12 forks source link

Segmentation fault, late stage of build #26

Closed supermar1010 closed 1 year ago

supermar1010 commented 3 years ago

I came pretty far with the build but then a segmentation fault was raised, not sure if this is arch specific.

Any ideas? Or should I post this on the tensorflow repo?

compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f16_i1_kernel_generator_kernel.o [for host]; 2s local
    compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f64_i1_kernel_generator_kernel.o [fERROR: /tmp/trizen-mario/tensorflow-rocm/src/tensorflow-2.5.0-rocm/tensorflow/core/kernels/mlir_generated/BUILD:957:23: compile tensorflow/core/kernels/mlir_generated/is_finite_gpu_f16_i1_kernel_generator_kernel.o [for host] failed: (Segmentation fault): tf_to_kernel failed: error executing command bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel '--unroll_factors=4' '--tile_sizes=256' '--arch=gfx701,gfx702,gfx803,gfx900,gfx904,gfx906,gfx908' ... (remaining 4 argument(s) skipped)
[20,437 / 21,317] 11 actions running
    compile tensorflow/core/kernels/mlir_generated/is_finite_gpu_f64_i1_kernel_generator_kernel.o [for host]; 3s local
    compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f16_i1_kernel_generator_kernel.o [for host]; 2s local
    compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f64_i1_kernel_generator_kernel.o [for host]; 2s local
    compile tensorflow/core/kernels/mlir_generated/is_nan_gpu_f16_i1_kernel_generator_kernel.o [f2021-06-18 15:06:05.299828: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
0x556475d2e878: i1 = FP_CLASS 0x5564779f2e08, Constant:i32<504>TensorFlow crashed, please file a bug on https://github.com/tensorflow/tensorflow/issues with the trace below.
Stack dump:
0.  Program arguments: bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel --unroll_factors=4 --tile_sizes=256 --arch=gfx701,gfx702,gfx803,gfx900,gfx904,gfx906,gfx908 --input=bazel-out/host/bin/tensorflow/core/kernels/mlir_generated/is_finite_gpu_f16_i1.mlir --output=bazel-out/host/bin/tensorflow/core/kernels/mlir_generated/is_finite_gpu_f16_i1_kernel_generator_kernel.o --enable_ftz=False --cpu_codegen=False
1.  2.  Running pass 'CallGraph Pass Manager' on module 'acme'.
3.  Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@IsFinite_GPU_DT_HALF_DT_BOOL_kernel'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x408d943)[0x556473344943]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x408bb0d)[0x556473342b0d]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x408bc94)[0x556473342c94]
/usr/lib/libpthread.so.0(+0x13870)[0x7f9763e9a870]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b070e8)[0x556471dbe0e8]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x18acc23)[0x556470b63c23]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a9adf2)[0x556471d51df2]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b6f3b6)[0x556471e263b6]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2bb88c6)[0x556471e6f8c6]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b6fa1e)[0x556471e26a1e]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b6fb98)[0x556471e26b98]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a80a73)[0x556471d37a73]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a837e0)[0x556471d3a7e0]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a856a6)[0x556471d3c6a6]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2e0f83f)[0x5564720c683f]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3f03645)[0x5564731ba645]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3bc50e7)[0x556472e7c0e7]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3f030a1)[0x5564731ba0a1]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x169de1d)[0x556470954e1d]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x16a2bef)[0x556470959bef]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0xbe3199)[0x55646fe9a199]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361381d)[0x5564728ca81d]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361394a)[0x5564728ca94a]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361427b)[0x5564728cb27b]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3612b6f)[0x5564728c9b6f]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x36133ac)[0x5564728ca3ac]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361394a)[0x5564728ca94a]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3615c06)[0x5564728ccc06]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x7f78dd)[0x55646faae8dd]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x6b5eb8)[0x55646f96ceb8]
/usr/lib/libc.so.6(__libc_start_main+0xd5)[0x7f9763348b25]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x7f060e)[0x55646faa760e]
[20,437 / 21,317] 11 actions running
    compile tensorflow/core/kernels/mlir_generated/is_finite_gpu_f64_i1_kernel_generator_kernel.o [for host]; 3s local
    compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f16_i1_kernel_generator_kernel.o [for host]; 2s local
    compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f64_i1_kernel_generator_kernel.o [for host]; 2s local
    compile tensorflow/core/kernels/mlir_generated/is_nan_gpu_f16_i1_kernel_generator_kernel.o [fERROR: /tmp/trizen-mario/tensorflow-rocm/src/tensorflow-2.5.0-rocm/tensorflow/tools/pip_package/BUILD:284:10 Middleman _middlemen/tensorflow_Stools_Spip_Upackage_Sbuild_Upip_Upackage-runfiles failed: (Segmentation fault): tf_to_kernel failed: error executing command bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel '--unroll_factors=4' '--tile_sizes=256' '--arch=gfx701,gfx702,gfx803,gfx900,gfx904,gfx906,gfx908' ... (remaining 4 argument(s) skipped)
INFO: Elapsed time: 11598.345s, Critical Path: 268.21s
INFO: 20448 processes: 1439 internal, 19009 local.
FAILED: Build did NOT complete successfully
==> ERROR: A failure occurred in build().
    Aborting...
:: Unable to build tensorflow-rocm - makepkg exited with code: 4
DumbledoreMD commented 3 years ago

Hi. Did you manage to find a workaround for this?

supermar1010 commented 3 years ago

I wouldn't say I found a workaround. I found out there are prevuolt versions in the arch4edu repo, it's linked in the readme :) Didn't work for me though because I have a rx570 which is not supported.

astrowave commented 3 years ago

ROCm doesn't support mlir-generated gpu kernels yet so we need to include a further build argument --define=tensorflow_enable_mlir_generated_gpu_kernels=0

EDIT:

The above is not exactly correct. Whilst it gets the build further along, AMD have dropped support for gfx803 (RX470/570) and previous. It is possible that it could still build against that target but it requires at least a rocBLAS workaround that isn't implemented so it seems any attempt to build this repo as it stands will fail whether the user has a gfx803 card or not.

Also, If bazel is being run using java16 we also need to set a vm flag as default behavior changed in this version. bazel --host_jvm_args=--illegal-access=permit

There are still other issues I'm facing but I will pull request all of this when (hopefully) I've managed to build tensorflow-rocm

riaqn commented 2 years ago

@astrowave I believe gfx803 support is reintroduced in rocm 4.3.0. Search 'gfx803' here http://radeonopencompute.github.io/ROCm/

Could you be so nice as to look into this?

I'm building 2.6.0 with rocm 4.3 and stuck at the issue in the OP. EDIT: after editing the PKGBUILD and restricting target to only gfx803, the compilation finishes! It might be because my rocblas is targeted for gfx803 only.

riaqn commented 2 years ago

@supermar1010 You might need to rebuild rocblas. Try to Delete library/src/blas3/Tensile/Logic/asmfull/r9nano*.yaml from rocBLAS. According to this: https://github.com/xuhuisheng/rocm-build/tree/master/gfx803#rocm-41-and-rocm-42-crashed-with-gfx803

acxz commented 1 year ago

Closing this issue as a stale build issue. If you have further issues please open up another issue. Sorry @supermar1010 @DumbledoreMD