rocm-arch / tensorflow-rocm

tensorflow-rocm AUR package
17 stars 12 forks source link

[tensorflow-rocm] does not build #25

Closed t1nux closed 1 year ago

t1nux commented 3 years ago

Hello

I can't get this to build on my updated arch system (with gcc-11.1.0).

> yay -S tensorflow-opt-rocm
:: Checking for conflicts...
:: Checking for inner conflicts...
[Aur:1]  tensorflow-rocm-2.4.0-4 (tensorflow-opt-rocm)

  1 tensorflow-rocm (tensorflow-opt-rocm) (Build Files Exist)
==> Packages to cleanBuild?
==> [N]one [A]ll [Ab]ort [I]nstalled [No]tInstalled or (1 2 3, 1-3, ^4)
==> A
:: Deleting (1/1): /home/tinux/.cache/yay/tensorflow-rocm
:: Downloaded PKGBUILD (1/1): tensorflow-rocm (tensorflow-opt-rocm)
  1 tensorflow-rocm (tensorflow-opt-rocm) (Build Files Exist)
==> Diffs to show?
==> [N]one [A]ll [Ab]ort [I]nstalled [No]tInstalled or (1 2 3, 1-3, ^4)
==>
:: (1/1) Parsing SRCINFO: tensorflow-rocm (tensorflow-opt-rocm)
==> Making package: tensorflow-rocm 2.4.0-4 (2021-06-09T11:15:47 CEST)
==> Retrieving sources...
  -> Downloading tensorflow-rocm-2.4.0.tar.gz...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   129  100   129    0     0    617      0 --:--:-- --:--:-- --:--:--   620
100 50.7M    0 50.7M    0     0  7169k      0 --:--:--  0:00:07 --:--:-- 8139k
  -> Found fix-h5py3.0.patch
  -> Found build-against-actual-mkl.patch
==> Validating source files with sha512sums...
    tensorflow-rocm-2.4.0.tar.gz ... Passed
    fix-h5py3.0.patch ... Passed
    build-against-actual-mkl.patch ... Passed
==> Making package: tensorflow-rocm 2.4.0-4 (2021-06-09T11:15:57 CEST)
==> Checking runtime dependencies...
==> Checking buildtime dependencies...
==> Retrieving sources...
  -> Found tensorflow-rocm-2.4.0.tar.gz
  -> Found fix-h5py3.0.patch
  -> Found build-against-actual-mkl.patch
==> Validating source files with sha512sums...
    tensorflow-rocm-2.4.0.tar.gz ... Passed
    fix-h5py3.0.patch ... Passed
    build-against-actual-mkl.patch ... Passed
==> Removing existing $srcdir/ directory...
==> Extracting sources...
  -> Extracting tensorflow-rocm-2.4.0.tar.gz with bsdtar
==> Starting prepare()...
patching file tensorflow/python/keras/saving/hdf5_format.py
==> Sources are ready.
==> Making package: tensorflow-rocm 2.4.0-4 (2021-06-09T11:16:03 CEST)
==> Checking runtime dependencies...
==> Checking buildtime dependencies...
==> WARNING: Using existing $srcdir/ tree
==> Starting build()...
/home/tinux/.cache/yay/tensorflow-rocm/PKGBUILD: line 111: /opt/cuda/bin/nvcc: No such file or directory
sed: can't read /usr/include/cudnn_version.h: No such file or directory
Building with rocm and without non-x86-64 optimizations
You have bazel 4.0.0 installed.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
    --config=mkl            # Build with MKL support.
    --config=mkl_aarch64    # Build with oneDNN support for Aarch64.
    --config=monolithic     # Config for mostly static monolithic build.
    --config=ngraph         # Build with Intel nGraph support.
    --config=numa           # Build with NUMA support.
    --config=dynamic_kernels    # (Experimental) Build kernels into separate shared objects.
    --config=v2             # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
    --config=noaws          # Disable AWS S3 filesystem support.
    --config=nogcp          # Disable GCP support.
    --config=nohdfs         # Disable HDFS support.
    --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=95
INFO: Reading rc options for 'build' from /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc:
  'build' options: --apple_platform_type=macos --define framework_shared_object=true --define open_source_build=true --java_toolchain=//third_party/toolchains/java:tf_java_toolchain --host_java_toolchain=//third_party/toolchains/java:tf_java_toolchain --define=tensorflow_enable_mlir_generated_gpu_kernels=0 --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --noincompatible_prohibit_aapt1 --enable_platform_specific_config --config=short_logs --config=v2
INFO: Reading rc options for 'build' from /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python --action_env PYTHON_LIB_PATH=/home/tinux/dev/python/modules --python_path=/usr/bin/python --action_env PYTHONPATH=:/home/tinux/dev/python/modules --config=xla --config=rocm --action_env LD_LIBRARY_PATH=/home/tinux/dev/cpp/lib --action_env TF_SYSTEM_LIBS=boringssl,curl,cython,gif,icu,libjpeg_turbo,lmdb,nasm,pcre,png,pybind11,zlib --action_env TF_CONFIGURE_IOS=0
INFO: Found applicable config definition build:short_logs in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:xla in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --define=with_xla_support=true
INFO: Found applicable config definition build:rocm in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --crosstool_top=@local_config_rocm//crosstool:toolchain --define=using_rocm=true --define=using_rocm_hipcc=true --action_env TF_NEED_ROCM=1
INFO: Found applicable config definition build:mkl in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --define=build_with_mkl=true --define=enable_mkl=true --define=tensorflow_mkldnn_contraction_kernel=0 --define=build_with_openmp=true -c opt
INFO: Found applicable config definition build:linux in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --copt=-w --host_copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --config=dynamic_kernels
INFO: Found applicable config definition build:dynamic_kernels in file /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
INFO: Analyzed 4 targets (412 packages loaded, 31130 targets configured).
INFO: Found 4 targets...
ERROR: /home/tinux/.cache/bazel/_bazel_tinux/d139faef51dbe2e3e865f944bca5e716/external/com_google_absl/absl/strings/BUILD.bazel:83:11: Compiling absl/strings/internal/utf8.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/strings:internal':
this rule is missing dependency declarations for the following files included by 'absl/strings/internal/utf8.cc':
  '/usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/include/stddef.h'
  '/usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/include/stdint.h'
  '/usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/include-fixed/limits.h'
  '/usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/include-fixed/syslimits.h'
INFO: Elapsed time: 7.125s, Critical Path: 0.49s
INFO: 18 processes: 18 internal.
FAILED: Build did NOT complete successfully
==> ERROR: A failure occurred in build().
    Aborting...
error making: tensorflow-rocm (tensorflow-opt-rocm)
t1nux commented 3 years ago

Setting this

export GCC_HOST_COMPILER_PATH=/usr/bin/gcc-10
export HOST_C_COMPILER=/usr/bin/gcc-10
export HOST_CXX_COMPILER=/usr/bin/g++-10
export CC=gcc-10
export CXX=g++-10

in PKGBUILD looks good at first, but eventually fails with

...
ERROR: /home/tinux/.cache/yay/tensorflow-rocm/src/tensorflow-2.4.0-rocm/tensorflow/BUILD:786:20: Linking tensorflow/libtensorflow.so.2.4.0 failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/k8-opt/bin/tensorflow/libtensorflow.so.2.4.0-2.params
bazel-out/k8-opt/bin/tensorflow/core/kernels/data/_objs/optional_ops_gpu/optional_ops.cu.pic.o:optional_ops.cu.cc:function tensorflow::Status tensorflow::data::OptionalZerosLike<Eigen::GpuDevice>(tensorflow::OpKernelContext*, tensorflow::data::OptionalVariant const&, tensorflow::data::OptionalVariant*): error: undefined reference to 'std::__throw_bad_array_new_length()'
bazel-out/k8-opt/bin/tensorflow/core/kernels/data/_objs/optional_ops_gpu/optional_ops.cu.pic.o:optional_ops.cu.cc:function tensorflow::Status tensorflow::data::OptionalBinaryAdd<Eigen::GpuDevice>(tensorflow::OpKernelContext*, tensorflow::data::OptionalVariant const&, tensorflow::data::OptionalVariant const&, tensorflow::data::OptionalVariant*): error: undefined reference to 'std::__throw_bad_array_new_length()'
bazel-out/k8-opt/bin/tensorflow/core/kernels/data/_objs/optional_ops_gpu/optional_ops.cu.pic.o:optional_ops.cu.cc:function std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Identity, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::_M_rehash_aux(unsigned long, std::integral_constant<bool, true>): error: undefined reference to 'std::__throw_bad_array_new_length()'
bazel-out/k8-opt/bin/tensorflow/core/kernels/data/_objs/optional_ops_gpu/optional_ops.cu.pic.o:optional_ops.cu.cc:function std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >, std::allocator<std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> > >, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> > const&>(std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*&, std::_Sp_alloc_shared_tag<std::allocator<std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> > > >, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> > const&): error: undefined reference to 'std::__throw_bad_array_new_length()'
collect2: error: ld returned 1 exit status
INFO: Elapsed time: 4006.308s, Critical Path: 358.11s
INFO: 16176 processes: 718 internal, 15458 local.
FAILED: Build did NOT complete successfully
==> ERROR: A failure occurred in build().
    Aborting...
error making: tensorflow-rocm (tensorflow-opt-rocm)
supermar1010 commented 3 years ago

Fix these two warnings first:

/home/tinux/.cache/yay/tensorflow-rocm/PKGBUILD: line 111: /opt/cuda/bin/nvcc: No such file or directory
sed: can't read /usr/include/cudnn_version.h: No such file or directory

By installing cudnn, this should fix both of these, after that I cleaned the cache dir, but I'm not sure if that was necessary

astrowave commented 3 years ago

cudnn is NVidia's CUDA framework, this is disabled in the rocm build so should not be enabled for this.

astrowave commented 3 years ago

@t1nux - Hi, I'm also getting this error and I'm trying to pin down what is causing it. What gpu were you building tensorflow for?

t1nux commented 2 years ago

@astrowave Sorry for the late reply. I'm trying to build for 2 slightly different GPUs on 2 different PCs. One is a Radeon VII and the other one is a Radeon VII Pro.

acxz commented 1 year ago

Closing this issue as a stale build issue. If you have further issues please open up another issue. Sorry @t1nux and @astrowave