llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.67k stars 11.86k forks source link

Build fails when building tensorflow in Docker. Lubuntu 22.04.3 #66983

Open cuchio opened 1 year ago

cuchio commented 1 year ago

Want to compile the latest build of tensorflow using the clang 17 libraries.

Latest tensorflow pulled has latest changes and so should be OK with cuda-toolkit 12.2, libcudnn8 8.9.4 and clang 17.

Therefore, following the git pull in docker, I installed cuda-toolkit 12.2 and libcudnn8 8.9.4.25-1+cuda12.2. cuda-toolkit-12.1 is also installed for tensorRT and cuda-toolkit-11.2 is installed by default.

The automatic configuration script (configure.py) works OK except that it does not detect the 12.2 library. So I changed the versions and paths to suite manually. See as follows.

build --action_env PYTHON_BIN_PATH="/usr/bin/python3.9" build --action_env PYTHON_LIB_PATH="/usr/lib/python3/dist-packages" build --python_path="/usr/bin/python3.9" build --config=tensorrt build --action_env TF_CUDA_VERSION="12.2" build --action_env TF_CUDNN_VERSION="8" build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda-12.2" build --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.5" build --action_env LD_LIBRARY_PATH="/usr/local/cuda-12.2/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/include/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs" build --config=cuda_clang build --action_env CLANG_CUDA_COMPILER_PATH="/usr/lib/llvm-17/bin/clang" build --config=cuda_clang build:opt --copt=-Wno-sign-compare build:opt --host_copt=-Wno-sign-compare test --test_size_filters=small,medium test --test_env=LD_LIBRARY_PATH test:v1 --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-oss_serial test:v1 --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu test:v2 --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-oss_serial,-v1only test:v2 --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-v1only

The bazel build then fails. Error note is:

ERROR: /root/.cache/bazel/bazel_root/43801f1e35f242fb634ebbc6079cf6c5/external/com_google_protobuf/BUILD.bazel:364:11: Compiling src/google/protobuf/compiler/cpp/service.cc [for tool] failed: (Segmentation fault): clang failed: error executing command (from target @com_google_protobuf//:protoc_lib) /usr/lib/llvm-17/bin/clang -MD -MF bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/objs/protoc_lib/0/service.d ... (remaining 49 arguments skipped) PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:

  1. Program arguments: /usr/lib/llvm-17/bin/clang -MD -MF bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/objs/protoc_lib/0/service.d -frandom-seed=bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/objs/protoc_lib/0/service.o -DBAZEL_CURRENT_REPOSITORY="com_google_protobuf" -iquote external/com_google_protobuf -iquote bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf -iquote external/zlib -iquote bazel-out/k8-opt-exec-50AE0418/bin/external/zlib -isystem external/com_google_protobuf/src -isystem bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/src -isystem external/zlib -isystem bazel-out/k8-opt-exec-50AE0418/bin/external/zlib -fmerge-all-constants -Wno-builtin-macro-redefined -DDATE="redacted" -DTIMESTAMP="redacted" -DTIME="redacted" -fPIE -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 -fstack-protector -Wall -Wno-invalid-partial-specialization -fno-omit-frame-pointer -no-canonical-prefixes -DNDEBUG -g0 -O2 -ffunction-sections -fdata-sections --cuda-path=/usr/local/cuda-12.2 -g0 -w -Wno-sign-compare -g0 -std=c++17 -DHAVE_ZLIB -Woverloaded-virtual -Wno-sign-compare -c external/com_google_protobuf/src/google/protobuf/compiler/cpp/service.cc -o bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/_objs/protoc_lib/0/service.o

    parser at end of file Optimizer

Target //tensorflow/tools/pip_package:build_pip_package failed to build Use --verbose_failures to see the command lines of failed build steps.

Thanks for your help.

Artem-B commented 1 year ago

It would help if you could provide the exact clang --version output and attach the shell script and the preprocessed input file clang should've mentioned at the end of the error message.

cuchio commented 1 year ago

O.K. Will do.

Firstly though, using the non docker method I have tried to build it a python venv. This worked well in the setup as it detected the cuda-toolkit 12.2 libraries without problem as well as clang 17 (ubuntu 22.04 version).

libcudNN8 was the same as before i.e. as required for the latest tensorrt (which, for dependencies, also installed cuda-toolkit 12).

The build went OK for a while before crashing. Clang diagnostic files attached. idl_gen_php-3f43d3.cpp.txt idl_gen_php-3f43d3.sh.txt

Artem-B commented 1 year ago

Clang built from recent sources appears to compile the file above w/o problems for me.

if you could provide the exact clang --version Please do provide the version.

cuchio commented 1 year ago

Version installed is:

Ubuntu clang version 17.0.1 (++20230919093346+e19b7dc36bc0-1~exp1~20230919093406.44) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin

I am currently checking all the required packages for llvm 17. I think some are not installed.

What I have installed is the packages listed on the llvm site.

1) All required packages

apt-get install clang-format clang-tidy clang-tools clang clangd libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python3-clang

with lldb left uninstalled due to error

lldb-18 : Depends: python3-lldb-18 but it is not installable

2) llvm 17.

apt-get install clang-17 lldb-17 lld-17 apt-get install libllvm-17-ocaml-dev libllvm17 llvm-17 llvm-17-dev llvm-17-doc llvm-17-examples llvm-17-runtime apt-get install clang-17 clang-tools-17 clang-17-doc libclang-common-17-dev libclang-17-dev libclang1-17 clang-format-17 python3-clang-17

These installed some new packages as well as updated other packages.

Will do and re-run build shortly.

cuchio commented 1 year ago

No Luck, error files attached.

code_generator-9a7c14.sh.txt code_generator-9a7c14.cpp.txt

Endilll commented 1 year ago

I tried both preprocessed sources with Clang 17.0.1 with assertions enabled, but both compiled just fine for me.

Endilll commented 1 year ago

@cuchio Since we are not able to reproduce your crash locally, it would be helpful if you can reduce preprocessed sources yourself using creduce, and post here a link to Compiler Explorer where crash reproduces. It doesn't have to be a nice and short reproducer: just something we can pick up on our side.

cuchio commented 1 year ago

OK, will do.

A quick question first though as I tried again today with all updates. It failed and I noted the following error:

clang: warning: argument unused during compilation: '--cuda-path=/usr/local/cuda-12.2' [-Wunused-command-line-argument]

Does this give a clue to the problem?

Also, I should note my CPU is old, pre avx. I assume this is not a problem (Core duo Q9650)

Thanks in advance.

Endilll commented 1 year ago

A quick question first though as I tried again today with all updates. It failed and I noted the following error: clang: warning: argument unused during compilation: '--cuda-path=/usr/local/cuda-12.2' [-Wunused-command-line-argument] Does this give a clue to the problem?

It doesn't ring a bell for me, unfortunately. We need something reproducible, so we can step through the crash locally in the debugger.

Also, I should note my CPU is old, pre avx. I assume this is not a problem (Core duo Q9650)

This might be significant if crash occurs in x86 backend. Thank you for sharing this bit.

Artem-B commented 1 year ago

clang: warning: argument unused during compilation: '--cuda-path=/usr/local/cuda-12.2' [-Wunused-command-line-argument] Does this give a clue to the problem?

This is expected -- you have CUDA path specified, but the failure happens in a c++ compilation which indeed has no use for the CUDA path.

cuchio commented 11 months ago

Hi, I have decided to try to compile and install using docker and the simplest method described at https://www.tensorflow.org/install/source.

"GPU support"

I have the nvidia drivers installed (nvidia-driver-545) and the nvidia_container-toolkit (https://github.com/NVIDIA/nvidia-container-toolkit)

Appears to be working fine. Command sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1650 On | 00000000:01:00.0 On | N/A | | 26% 36C P8 9W / 75W | 89MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+

So, appears to be OK.

I then run the docker build steps

  1. docker pull tensorflow/tensorflow:devel-gpu

  2. docker run --gpus all -it -w /tensorflow -v $PWD:/mnt -e HOST_PERMS="$(id -u):$(id -g)" \ tensorflow/tensorflow:devel-gpu bash

I cd into the tensorflow_src directory (as this is where .git is) and

  1. git pull # within the container, download the latest source code

Appears to update version

  1. Then ./configure. which produces the following

root@42d447a8c780:/tensorflow_src# ./configure You have bazel 6.1.0 installed. Please specify the location of python. [Default is /usr/bin/python3]:

Found possible Python library paths: /usr/lib/python3/dist-packages /usr/local/lib/python3.8/dist-packages Please input the desired Python library path to use. Default is [/usr/lib/python3/dist-packages]

Do you wish to build TensorFlow with ROCm support? [y/N]: y
ROCm support will be enabled for TensorFlow.

Could not find any nvml.h in any subdirectory: '' 'include' 'include/cuda' 'include/*-linux-gnu' 'extras/CUPTI/include' 'include/cuda/CUPTI' 'local/cuda/extras/CUPTI/include' 'targets/x86_64-linux/include' of: '/lib/x86_64-linux-gnu' '/usr' '/usr/local/cuda' '/usr/local/cuda-11.2' '/usr/local/cuda/lib64/stubs' '/usr/local/cuda/targets/x86_64-linux/lib'

Asking for detailed CUDA configuration...

Please specify the TensorRT version you want to use. [Leave empty to default to TensorRT 6]:


So, it appears it cant find nvml. I have tried to find it without luck. So I am wondering if it is installed?

I found a discussion similar to this at https://stackoverflow.com/questions/63030087/tensorflow-source-build-configuration-fails-could-not-find-any-cuda-h-matching

The final points say that, in their case, the cuda directories are not found and a suggestion made to modify configure.py. Does this give an idea to my problem.

Please note that the image pulled contains python3.8 and cuda-toolkit-11.2.

Thanks in advance.