Open cuchio opened 1 year ago
It would help if you could provide the exact clang --version
output and attach the shell script and the preprocessed input file clang should've mentioned at the end of the error message.
O.K. Will do.
Firstly though, using the non docker method I have tried to build it a python venv. This worked well in the setup as it detected the cuda-toolkit 12.2 libraries without problem as well as clang 17 (ubuntu 22.04 version).
libcudNN8 was the same as before i.e. as required for the latest tensorrt (which, for dependencies, also installed cuda-toolkit 12).
The build went OK for a while before crashing. Clang diagnostic files attached. idl_gen_php-3f43d3.cpp.txt idl_gen_php-3f43d3.sh.txt
Clang built from recent sources appears to compile the file above w/o problems for me.
if you could provide the exact clang --version Please do provide the version.
Version installed is:
Ubuntu clang version 17.0.1 (++20230919093346+e19b7dc36bc0-1~exp1~20230919093406.44) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin
I am currently checking all the required packages for llvm 17. I think some are not installed.
What I have installed is the packages listed on the llvm site.
1) All required packages
apt-get install clang-format clang-tidy clang-tools clang clangd libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python3-clang
with lldb left uninstalled due to error
lldb-18 : Depends: python3-lldb-18 but it is not installable
2) llvm 17.
apt-get install clang-17 lldb-17 lld-17 apt-get install libllvm-17-ocaml-dev libllvm17 llvm-17 llvm-17-dev llvm-17-doc llvm-17-examples llvm-17-runtime apt-get install clang-17 clang-tools-17 clang-17-doc libclang-common-17-dev libclang-17-dev libclang1-17 clang-format-17 python3-clang-17
These installed some new packages as well as updated other packages.
Will do and re-run build shortly.
No Luck, error files attached.
I tried both preprocessed sources with Clang 17.0.1 with assertions enabled, but both compiled just fine for me.
@cuchio Since we are not able to reproduce your crash locally, it would be helpful if you can reduce preprocessed sources yourself using creduce
, and post here a link to Compiler Explorer where crash reproduces. It doesn't have to be a nice and short reproducer: just something we can pick up on our side.
OK, will do.
A quick question first though as I tried again today with all updates. It failed and I noted the following error:
clang: warning: argument unused during compilation: '--cuda-path=/usr/local/cuda-12.2' [-Wunused-command-line-argument]
Does this give a clue to the problem?
Also, I should note my CPU is old, pre avx. I assume this is not a problem (Core duo Q9650)
Thanks in advance.
A quick question first though as I tried again today with all updates. It failed and I noted the following error: clang: warning: argument unused during compilation: '--cuda-path=/usr/local/cuda-12.2' [-Wunused-command-line-argument] Does this give a clue to the problem?
It doesn't ring a bell for me, unfortunately. We need something reproducible, so we can step through the crash locally in the debugger.
Also, I should note my CPU is old, pre avx. I assume this is not a problem (Core duo Q9650)
This might be significant if crash occurs in x86 backend. Thank you for sharing this bit.
clang: warning: argument unused during compilation: '--cuda-path=/usr/local/cuda-12.2' [-Wunused-command-line-argument]
Does this give a clue to the problem?
This is expected -- you have CUDA path specified, but the failure happens in a c++ compilation which indeed has no use for the CUDA path.
Hi, I have decided to try to compile and install using docker and the simplest method described at https://www.tensorflow.org/install/source.
"GPU support"
I have the nvidia drivers installed (nvidia-driver-545) and the nvidia_container-toolkit (https://github.com/NVIDIA/nvidia-container-toolkit)
Appears to be working fine. Command sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1650 On | 00000000:01:00.0 On | N/A | | 26% 36C P8 9W / 75W | 89MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+
So, appears to be OK.
I then run the docker build steps
docker pull tensorflow/tensorflow:devel-gpu
docker run --gpus all -it -w /tensorflow -v $PWD:/mnt -e HOST_PERMS="$(id -u):$(id -g)" \ tensorflow/tensorflow:devel-gpu bash
I cd into the tensorflow_src directory (as this is where .git is) and
Appears to update version
root@42d447a8c780:/tensorflow_src# ./configure You have bazel 6.1.0 installed. Please specify the location of python. [Default is /usr/bin/python3]:
Found possible Python library paths: /usr/lib/python3/dist-packages /usr/local/lib/python3.8/dist-packages Please input the desired Python library path to use. Default is [/usr/lib/python3/dist-packages]
Do you wish to build TensorFlow with ROCm support? [y/N]: y
ROCm support will be enabled for TensorFlow.
Could not find any nvml.h in any subdirectory: '' 'include' 'include/cuda' 'include/*-linux-gnu' 'extras/CUPTI/include' 'include/cuda/CUPTI' 'local/cuda/extras/CUPTI/include' 'targets/x86_64-linux/include' of: '/lib/x86_64-linux-gnu' '/usr' '/usr/local/cuda' '/usr/local/cuda-11.2' '/usr/local/cuda/lib64/stubs' '/usr/local/cuda/targets/x86_64-linux/lib'
Asking for detailed CUDA configuration...
Please specify the TensorRT version you want to use. [Leave empty to default to TensorRT 6]:
So, it appears it cant find nvml. I have tried to find it without luck. So I am wondering if it is installed?
I found a discussion similar to this at https://stackoverflow.com/questions/63030087/tensorflow-source-build-configuration-fails-could-not-find-any-cuda-h-matching
The final points say that, in their case, the cuda directories are not found and a suggestion made to modify configure.py. Does this give an idea to my problem.
Please note that the image pulled contains python3.8 and cuda-toolkit-11.2.
Thanks in advance.
Want to compile the latest build of tensorflow using the clang 17 libraries.
Latest tensorflow pulled has latest changes and so should be OK with cuda-toolkit 12.2, libcudnn8 8.9.4 and clang 17.
Therefore, following the git pull in docker, I installed cuda-toolkit 12.2 and libcudnn8 8.9.4.25-1+cuda12.2. cuda-toolkit-12.1 is also installed for tensorRT and cuda-toolkit-11.2 is installed by default.
The automatic configuration script (configure.py) works OK except that it does not detect the 12.2 library. So I changed the versions and paths to suite manually. See as follows.
build --action_env PYTHON_BIN_PATH="/usr/bin/python3.9" build --action_env PYTHON_LIB_PATH="/usr/lib/python3/dist-packages" build --python_path="/usr/bin/python3.9" build --config=tensorrt build --action_env TF_CUDA_VERSION="12.2" build --action_env TF_CUDNN_VERSION="8" build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda-12.2" build --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.5" build --action_env LD_LIBRARY_PATH="/usr/local/cuda-12.2/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/include/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs" build --config=cuda_clang build --action_env CLANG_CUDA_COMPILER_PATH="/usr/lib/llvm-17/bin/clang" build --config=cuda_clang build:opt --copt=-Wno-sign-compare build:opt --host_copt=-Wno-sign-compare test --test_size_filters=small,medium test --test_env=LD_LIBRARY_PATH test:v1 --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-oss_serial test:v1 --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu test:v2 --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-oss_serial,-v1only test:v2 --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-v1only
The bazel build then fails. Error note is:
ERROR: /root/.cache/bazel/bazel_root/43801f1e35f242fb634ebbc6079cf6c5/external/com_google_protobuf/BUILD.bazel:364:11: Compiling src/google/protobuf/compiler/cpp/service.cc [for tool] failed: (Segmentation fault): clang failed: error executing command (from target @com_google_protobuf//:protoc_lib) /usr/lib/llvm-17/bin/clang -MD -MF bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/objs/protoc_lib/0/service.d ... (remaining 49 arguments skipped) PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:
Program arguments: /usr/lib/llvm-17/bin/clang -MD -MF bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/objs/protoc_lib/0/service.d -frandom-seed=bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/objs/protoc_lib/0/service.o -DBAZEL_CURRENT_REPOSITORY="com_google_protobuf" -iquote external/com_google_protobuf -iquote bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf -iquote external/zlib -iquote bazel-out/k8-opt-exec-50AE0418/bin/external/zlib -isystem external/com_google_protobuf/src -isystem bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/src -isystem external/zlib -isystem bazel-out/k8-opt-exec-50AE0418/bin/external/zlib -fmerge-all-constants -Wno-builtin-macro-redefined -DDATE="redacted" -DTIMESTAMP="redacted" -DTIME="redacted" -fPIE -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 -fstack-protector -Wall -Wno-invalid-partial-specialization -fno-omit-frame-pointer -no-canonical-prefixes -DNDEBUG -g0 -O2 -ffunction-sections -fdata-sections --cuda-path=/usr/local/cuda-12.2 -g0 -w -Wno-sign-compare -g0 -std=c++17 -DHAVE_ZLIB -Woverloaded-virtual -Wno-sign-compare -c external/com_google_protobuf/src/google/protobuf/compiler/cpp/service.cc -o bazel-out/k8-opt-exec-50AE0418/bin/external/com_google_protobuf/_objs/protoc_lib/0/service.o
Target //tensorflow/tools/pip_package:build_pip_package failed to build Use --verbose_failures to see the command lines of failed build steps.
Thanks for your help.