tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.41k stars 74.17k forks source link

Building from source "undefined reference to `cublasGemmStridedBatchedEx' " #34401

Closed NKCSRzChen closed 3 years ago

NKCSRzChen commented 4 years ago

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

Describe the problem

Provide the exact sequence of commands / steps that you executed before running into the problem You have bazel 0.29.1- (@non-git) installed. Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python

Found possible Python library paths: /usr/lib/python3/dist-packages /usr/local/lib/python3.5/dist-packages Please input the desired Python library path to use. Default is [/usr/lib/python3/dist-packages] /usr/local/lib/python3.5/dist-packages Do you wish to build TensorFlow with XLA JIT support? [Y/n]: y XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]: n No TensorRT support will be enabled for TensorFlow.

Found CUDA 9.1 in: /usr/local/cuda/lib64 /usr/local/cuda/include Found cuDNN 7 in: /usr/local/cuda/lib64 /usr/local/cuda/include

Please specify a list of comma-separated CUDA compute capabilities you want to build with. You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 5.2]: 5.2

Do you want to use clang as CUDA compiler? [y/N]: n nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details. --config=mkl # Build with MKL support. --config=monolithic # Config for mostly static monolithic build. --config=ngraph # Build with Intel nGraph support. --config=numa # Build with NUMA support. --config=dynamic_kernels # (Experimental) Build kernels into separate shared objects. --config=v2 # Build TensorFlow 2.x instead of 1.x. Preconfigured Bazel build configs to DISABLE default on features: --config=noaws # Disable AWS S3 filesystem support. --config=nogcp # Disable GCP support. --config=nohdfs # Disable HDFS support. --config=nonccl # Disable NVIDIA NCCL support. Configuration finished

$bazel build --verbose_failures --action_env=LD_LIBRARY_PATH=/usr/local/cuda/lib64 --config=cuda --config=opt //tensorflow/tools/pip_package:build_pip_package --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --config=nonccl --config=nohdfs --config=mkl --config=monolithic --config=noaws --config=nogcp --config=monolithic

Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

ERROR: /home/tensorflow/tensorflow/tensorflow/lite/toco/BUILD:439:1: Linking of rule '//tensorflow/lite/toco:toco' failed (Exit 1) [19,054 / 21,282] 46 actions running Compiling tensorflow/core/kernels/slice_op_gpu.cu.cc [for host]; 181s local Compiling tensorflow/core/kernels/pad_op_gpu.cu.cc [for host]; 154s local Compiling tensorflow/core/kernels/tile_functor_cpu.cc [for host]; 123s local Compiling tensorflow/core/kernels/resource_variable_ops.cc [for host]; 106s local Compiling tensorflow/core/kernels/conv_ops_fused_float.cc [for host]; 94s local Compiling tensorflow/core/kernels/conv_ops_fused_double.cc [for host]; 91s local Compiling tensorflow/core/kernels/conv_ops.cc [for host]; 91s local Compiling tensorflow/core/kernels/conv_grad_ops_3d.cc [for host]; 90s local ... bazel-out/host/bin/tensorflow/stream_executor/cuda/libcublas_plugin.lo(cuda_blas.o): In function tensorflow::Status stream_executor::gpu::CUDABlas::DoBlasGemmBatchedInternal<Eigen::half, float, cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float const*, float const**, int, float const**, int, float const*, float**, int, int)>(cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float const*, float const**, int, float const**, int, float const*, float**, int, int), stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, absl::Span<stream_executor::DeviceMemory<Eigen::half>* const> const&, int, absl::Span<stream_executor::DeviceMemory<Eigen::half>* const> const&, int, float, absl::Span<stream_executor::DeviceMemory<Eigen::half>* const> const&, int, int, stream_executor::ScratchAllocator*)': cuda_blas.cc:(.text._ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalIN5Eigen4halfEfPF14cublasStatus_tP13cublasContext17cublasOperation_tS8_iiiPKfPSA_iSB_iSA_PPfiiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESM_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSX_iSN_SX_iiPNS_16ScratchAllocatorE[_ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalIN5Eigen4halfEfPF14cublasStatus_tP13cublasContext17cublasOperation_tS8_iiiPKfPSA_iSB_iSA_PPfiiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESM_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSX_iSN_SX_iiPNS_16ScratchAllocatorE]+0x8c2): undefined reference tocublasGemmBatchedEx' bazel-out/host/bin/tensorflow/stream_executor/cuda/libcublas_plugin.lo(cuda_blas.o): In function tensorflow::Status stream_executor::gpu::CUDABlas::DoBlasGemmBatchedInternal<float, float, cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float const*, float const**, int, float const**, int, float const*, float**, int, int)>(cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float const*, float const**, int, float const**, int, float const*, float**, int, int), stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, absl::Span<stream_executor::DeviceMemory<float>* const> const&, int, absl::Span<stream_executor::DeviceMemory<float>* const> const&, int, float, absl::Span<stream_executor::DeviceMemory<float>* const> const&, int, int, stream_executor::ScratchAllocator*)': cuda_blas.cc:(.text._ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalIffPF14cublasStatus_tP13cublasContext17cublasOperation_tS6_iiiPKfPS8_iS9_iS8_PPfiiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESK_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSV_iSL_SV_iiPNS_16ScratchAllocatorE[_ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalIffPF14cublasStatus_tP13cublasContext17cublasOperation_tS6_iiiPKfPS8_iS9_iS8_PPfiiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESK_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSV_iSL_SV_iiPNS_16ScratchAllocatorE]+0x9c7): undefined reference tocublasGemmBatchedEx' bazel-out/host/bin/tensorflow/stream_executor/cuda/libcublas_plugin.lo(cuda_blas.o): In function tensorflow::Status stream_executor::gpu::CUDABlas::DoBlasGemmBatchedInternal<double, double, cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, double const*, double const**, int, double const**, int, double const*, double**, int, int)>(cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, double const*, double const**, int, double const**, int, double const*, double**, int, int), stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, double, absl::Span<stream_executor::DeviceMemory<double>* const> const&, int, absl::Span<stream_executor::DeviceMemory<double>* const> const&, int, double, absl::Span<stream_executor::DeviceMemory<double>* const> const&, int, int, stream_executor::ScratchAllocator*)': cuda_blas.cc:(.text._ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalIddPF14cublasStatus_tP13cublasContext17cublasOperation_tS6_iiiPKdPS8_iS9_iS8_PPdiiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESK_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSV_iSL_SV_iiPNS_16ScratchAllocatorE[_ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalIddPF14cublasStatus_tP13cublasContext17cublasOperation_tS6_iiiPKdPS8_iS9_iS8_PPdiiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESK_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSV_iSL_SV_iiPNS_16ScratchAllocatorE]+0xceb): undefined reference tocublasGemmBatchedEx' bazel-out/host/bin/tensorflow/stream_executor/cuda/libcublas_plugin.lo(cuda_blas.o): In function tensorflow::Status stream_executor::gpu::CUDABlas::DoBlasGemmBatchedInternal<std::complex<float>, std::complex<float>, cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float2 const*, float2 const**, int, float2 const**, int, float2 const*, float2**, int, int)>(cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float2 const*, float2 const**, int, float2 const**, int, float2 const*, float2**, int, int), stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, std::complex<float>, absl::Span<stream_executor::DeviceMemory<std::complex<float> >* const> const&, int, absl::Span<stream_executor::DeviceMemory<std::complex<float> >* const> const&, int, std::complex<float>, absl::Span<stream_executor::DeviceMemory<std::complex<float> >* const> const&, int, int, stream_executor::ScratchAllocator*)': cuda_blas.cc:(.text._ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalISt7complexIfES4_PF14cublasStatus_tP13cublasContext17cublasOperation_tS8_iiiPK6float2PSB_iSC_iSB_PPS9_iiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESN_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSY_iSO_SY_iiPNS_16ScratchAllocatorE[_ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalISt7complexIfES4_PF14cublasStatus_tP13cublasContext17cublasOperation_tS8_iiiPK6float2PSB_iSC_iSB_PPS9_iiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESN_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSY_iSO_SY_iiPNS_16ScratchAllocatorE]+0xd2b): undefined reference tocublasGemmBatchedEx' bazel-out/host/bin/tensorflow/stream_executor/cuda/libcublas_plugin.lo(cuda_blas.o): In function tensorflow::Status stream_executor::gpu::CUDABlas::DoBlasGemmBatchedInternal<std::complex<double>, std::complex<double>, cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, double2 const*, double2 const**, int, double2 const**, int, double2 const*, double2**, int, int)>(cublasStatus_t (*)(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, double2 const*, double2 const**, int, double2 const**, int, double2 const*, double2**, int, int), stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, std::complex<double>, absl::Span<stream_executor::DeviceMemory<std::complex<double> >* const> const&, int, absl::Span<stream_executor::DeviceMemory<std::complex<double> >* const> const&, int, std::complex<double>, absl::Span<stream_executor::DeviceMemory<std::complex<double> >* const> const&, int, int, stream_executor::ScratchAllocator*)': cuda_blas.cc:(.text._ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalISt7complexIdES4_PF14cublasStatus_tP13cublasContext17cublasOperation_tS8_iiiPK7double2PSB_iSC_iSB_PPS9_iiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESN_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSY_iSO_SY_iiPNS_16ScratchAllocatorE[_ZN15stream_executor3gpu8CUDABlas25DoBlasGemmBatchedInternalISt7complexIdES4_PF14cublasStatus_tP13cublasContext17cublasOperation_tS8_iiiPK7double2PSB_iSC_iSB_PPS9_iiEEEN10tensorflow6StatusET1_PNS_6StreamENS_4blas9TransposeESN_yyyT0_RKN4absl4SpanIKPNS_12DeviceMemoryIT_EEEEiSY_iSO_SY_iiPNS_16ScratchAllocatorE]+0xcfb): undefined reference tocublasGemmBatchedEx' bazel-out/host/bin/tensorflow/stream_executor/cuda/libcublas_plugin.lo(cuda_blas.o): In function stream_executor::gpu::CUDABlas::DoBlasGemmStridedBatched(stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory<Eigen::half> const&, int, long long, stream_executor::DeviceMemory<Eigen::half> const&, int, long long, float, stream_executor::DeviceMemory<Eigen::half>*, int, long long, int)': cuda_blas.cc:(.text._ZN15stream_executor3gpu8CUDABlas24DoBlasGemmStridedBatchedEPNS_6StreamENS_4blas9TransposeES5_yyyfRKNS_12DeviceMemoryIN5Eigen4halfEEEixSB_ixfPS9_ixi+0x2e3): undefined reference tocublasGemmStridedBatchedEx' collect2: error: ld returned 1 exit status [19,054 / 21,282] 46 actions running Compiling tensorflow/core/kernels/slice_op_gpu.cu.cc [for host]; 181s local Compiling tensorflow/core/kernels/pad_op_gpu.cu.cc [for host]; 154s local Compiling tensorflow/core/kernels/tile_functor_cpu.cc [for host]; 123s local Compiling tensorflow/core/kernels/resource_variable_ops.cc [for host]; 106s local Compiling tensorflow/core/kernels/conv_ops_fused_float.cc [for host]; 94s local Compiling tensorflow/core/kernels/conv_ops_fused_double.cc [for host]; 91s local Compiling tensorflow/core/kernels/conv_ops.cc [for host]; 91s local Compiling tensorflow/core/kernels/conv_grad_ops_3d.cc [for host]; 90s local ... Target //tensorflow/tools/pip_package:build_pip_package failed to build [19,101 / 21,282] checking cached actions INFO: Elapsed time: 1270.016s, Critical Path: 336.18s [19,101 / 21,282] checking cached actions INFO: 10667 processes: 10667 local. [19,101 / 21,282] checking cached actions FAILED: Build did NOT complete successfully FAILED: Build did NOT complete successfully

I just search all the issues about "undefined reference to `cublasGemmStridedBatchedEx'", but I believe mine is different.

NKCSRzChen commented 4 years ago

hello, I think the reason of this issue has already been figured out. The CUDA version of my system is 9.1, however tensorflow v1.15 requires at least CUDA 10. More details could be found in https://www.tensorflow.org/install/source#common_installation_problems

Now I wonder why I cannot build tensorflow 1.15 from source code in my system environment, yet I could use pip to install tensorflow 1.15 or 2.0 version?

grzegorzk commented 4 years ago

Today I tried to build tensorflow 1.14 on ubuntu 16.04 with CUDA 9.1 and CUDNN 7 (using nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04 docker image) and my build failed with same error:

ERROR: /opt/tensorflow/tensorflow/tensorflow/contrib/reduce_slice_ops/BUILD:50:1: Linking of rule '//tensorflow/contrib/reduce_slice_ops:gen_reduce_slice_ops_py_wrappers_cc' failed (Exit 1)
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Sreduce_Uslice_Uops_Cgen_Ureduce_Uslice_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `cublasGemmBatchedEx'
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Sreduce_Uslice_Uops_Cgen_Ureduce_Uslice_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined reference to `cublasGemmStridedBatchedEx'
collect2: error: ld returned 1 exit status
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 813.405s, Critical Path: 96.34s
INFO: 4817 processes: 4817 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
grzegorzk commented 4 years ago

For anyone facing the same issue - I was using nvidia docker container with CUDA 9.1 and CUDNN 7. I had to add following step before building tensorflow:

echo "/usr/local/cuda-9.1/targets/x86_64-linux/lib" > /etc/ld.so.conf.d/cuda_targets.conf
ldconfig

For whatever reason /usr/local/cuda-9.1/targets/x86_64-linux/lib was not there.

eLvErDe commented 3 years ago

Hello,

I can confirm the same behavior while attempting to build 1.14 with CUDA coming from either Ubuntu 18.04 or Debian 9 with backports.

Any help would be greatly appreciated,

Regards, Adam.

tensorflowbutler commented 3 years ago

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No