llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.98k stars 11.94k forks source link

LLVM stage 3 builds on Alpine fail with OOM even though it's not OOM #60499

Open Gelbpunkt opened 1 year ago

Gelbpunkt commented 1 year ago

This is a very weird issue that I can consistently reproduce in an Alpine Linux environment, for example with a container.

On Alpine Linux, stage 3 builds will always error after reporting "Out of memory" like so:

ninja: job failed: : && /build/build/llvm/stage1/bin/clang++ -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fprofile-instr-use="/build/build/llvm/profdata.prof" -flto=thin -fno-common -Woverloaded-virtual -Wno-nested-anon-types -O2 -g -DNDEBUG -Wl,--emit-relocs -fuse-ld=/build/build/llvm/stage1/bin/ld.lld -Wl,--color-diagnostics -fprofile-instr-use="/build/build/llvm/profdata.prof" -flto=thin    -Wl,--gc-sections tools/clang/tools/clang-import-test/CMakeFiles/clang-import-test.dir/clang-import-test.cpp.o -o bin/clang-import-test  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMCore.a  lib/libLLVMSupport.a  lib/libLLVMTargetParser.a  lib/libclangAST.a  lib/libclangBasic.a  lib/libclangCodeGen.a  lib/libclangDriver.a  lib/libclangFrontend.a  lib/libclangLex.a  lib/libclangParse.a  lib/libclangSerialization.a  lib/libclangDriver.a  lib/libLLVMWindowsDriver.a  lib/libLLVMOption.a  lib/libclangSema.a  lib/libclangEdit.a  lib/libclangSupport.a  lib/libclangAnalysis.a  lib/libclangASTMatchers.a  lib/libclangAST.a  lib/libclangLex.a  lib/libclangBasic.a  lib/libLLVMCoverage.a  lib/libLLVMFrontendHLSL.a  lib/libLLVMLTO.a  lib/libLLVMExtensions.a  lib/libPolly.a  lib/libPollyISL.a  lib/libLLVMPasses.a  lib/libLLVMCoroutines.a  lib/libLLVMipo.a  lib/libLLVMFrontendOpenMP.a  lib/libLLVMLinker.a  lib/libLLVMIRPrinter.a  lib/libLLVMInstrumentation.a  lib/libLLVMVectorize.a  lib/libLLVMCodeGen.a  lib/libLLVMBitWriter.a  lib/libLLVMObjCARCOpts.a  lib/libLLVMScalarOpts.a  lib/libLLVMAggressiveInstCombine.a  lib/libLLVMInstCombine.a  lib/libLLVMTarget.a  lib/libLLVMTransformUtils.a  lib/libLLVMAnalysis.a  lib/libLLVMProfileData.a  lib/libLLVMSymbolize.a  lib/libLLVMDebugInfoPDB.a  lib/libLLVMDebugInfoMSF.a  lib/libLLVMDebugInfoDWARF.a  lib/libLLVMObject.a  lib/libLLVMIRReader.a  lib/libLLVMBitReader.a  lib/libLLVMAsmParser.a  lib/libLLVMCore.a  lib/libLLVMRemarks.a  lib/libLLVMBitstreamReader.a  lib/libLLVMMCParser.a  lib/libLLVMMC.a  lib/libLLVMDebugInfoCodeView.a  lib/libLLVMTextAPI.a  lib/libLLVMBinaryFormat.a  lib/libLLVMTargetParser.a  lib/libLLVMSupport.a  lib/libLLVMDemangle.a  -lrt  -ldl  -lm  /lib/libz.so && :
LLVM ERROR: out of memory
Allocation failed
clang-17: error: unable to execute command: Aborted (core dumped)

The issue is that the system is not out of memory at all. I'm building on a system with an AMD Epyc 7402P (24c/48t) and 256GB of RAM. I wrote a very simple Python script to ensure that this is NOT an OOM and my eyes looking at top did not fool me:

import humanize
import psutil
import time

way_too_big_integer = 1 * (10 ** 12) # 1TB
smallest_memory_free = way_too_big_integer
smallest_memory_available = way_too_big_integer
highest_memory_used = 0
highest_percent = 0

while True:
    memory = psutil.virtual_memory()

    smallest_memory_free = min(smallest_memory_free, memory.free)
    smallest_memory_available = min(smallest_memory_available, memory.available)
    highest_memory_used = max(highest_memory_used, memory.used)
    highest_percent = max(highest_percent, memory.percent)

    print(f"Highest Used: {humanize.naturalsize(highest_memory_used)} ({highest_percent}%), Smallest Free: {humanize.naturalsize(smallest_memory_free)}, Smallest Available: {humanize.naturalsize(smallest_memory_available)}")

    time.sleep(0.1)

All it does is it collects memory usage information every 0.1s and keeps track of the extreme values.

At the end of the build, it shows:

Highest Used: 58.7 GB (36.3%), Smallest Free: 132.6 GB, Smallest Available: 150.6 GB

So there is enough memory for LLVM!

Here's my container setup:

podman run \
    --rm \
    -it \
    --name clang \
    --privileged \
    --pids-limit=-1 \
    --ulimit=host \
    --ipc=host \
    --cgroups=disabled \
    --security-opt label=disable \
    alpine:edge ash

The flags ensure there are zero limitations to CPU and memory usage in the container. I can build AOSP just fine with the same flags, so the allocation failure is definitely not due to the container setup.

These commands reproduce the error in this container:

$ mkdir /build && cd /build
# System upgrade
$ apk upgrade
# LLVM dependencies
$ apk add clang cmake git linux-headers lld llvm musl-dev ninja python3 git zlib-dev
# LLVM sources
$ git clone https://github.com/llvm/llvm-project.git --depth 1
$ mkdir -p build/llvm/stage1 && cd build/llvm/stage1

# Stage 1
$ cmake \
    -G Ninja \
    -Wno-dev \
    -DCLANG_ENABLE_ARCMT=OFF \
    -DCLANG_ENABLE_STATIC_ANALYZER=OFF \
    -DCLANG_PLUGIN_SUPPORT=OFF \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_OCAMLDOC=OFF \
    -DLLVM_INCLUDE_DOCS=OFF \
    -DLLVM_INCLUDE_EXAMPLES=OFF \
    -DCMAKE_C_COMPILER=/usr/bin/clang-15 \
    -DCMAKE_CXX_COMPILER=/usr/bin/clang++ \
    -DLLVM_USE_LINKER=/usr/bin/ld.lld \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-alpine-linux-musl \
    -DLLVM_ENABLE_PROJECTS="clang;lld;bolt;compiler-rt" \
    -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
    -DCOMPILER_RT_BUILD_CRT=OFF \
    -DCOMPILER_RT_BUILD_XRAY=OFF \
    -DCOMPILER_RT_BUILD_SANITIZERS=OFF \
    -DCOMPILER_RT_BUILD_GWP_ASAN=OFF \
    -DLLVM_TARGETS_TO_BUILD=host \
    -DCMAKE_BUILD_TYPE=Release \
    -DLLVM_BUILD_UTILS=OFF \
    -DLLVM_ENABLE_BACKTRACES=OFF \
    -DLLVM_ENABLE_WARNINGS=OFF \
    -DLLVM_INCLUDE_TESTS=OFF \
    -DLLVM_ENABLE_TERMINFO=OFF \
    -DLLVM_PARALLEL_LINK_JOBS=14 \
    /build/llvm-project/llvm
$ ninja

$ mkdir ../stage2 && cd ../stage2

# Stage 2
$ cmake \
    -G Ninja \
    -Wno-dev \
    -DCLANG_ENABLE_ARCMT=OFF \
    -DCLANG_ENABLE_STATIC_ANALYZER=OFF \
    -DCLANG_PLUGIN_SUPPORT=OFF \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_OCAMLDOC=OFF \
    -DLLVM_INCLUDE_DOCS=OFF \
    -DLLVM_INCLUDE_EXAMPLES=OFF \
    -DCMAKE_AR=/build/build/llvm/stage1/bin/llvm-ar \
    -DCMAKE_C_COMPILER=/build/build/llvm/stage1/bin/clang \
    -DCLANG_TABLEGEN=/build/build/llvm/stage1/bin/clang-tblgen \
    -DCMAKE_CXX_COMPILER=/build/build/llvm/stage1/bin/clang++ \
    -DLLVM_USE_LINKER=/build/build/llvm/stage1/bin/ld.lld \
    -DLLVM_TABLEGEN=/build/build/llvm/stage1/bin/llvm-tblgen \
    -DCMAKE_RANLIB=/build/build/llvm/stage1/bin/llvm-ranlib \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-alpine-linux-musl \
    -DLLVM_ENABLE_PROJECTS="clang;lld" \
    -DLLVM_TARGETS_TO_BUILD="ARM;AArch64;X86" \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_INSTALL_PREFIX=/build/install \
    -DLLVM_BUILD_INSTRUMENTED=IR \
    -DLLVM_BUILD_RUNTIME=OFF \
    -DLLVM_LINK_LLVM_DYLIB=ON \
    -DLLVM_VP_COUNTERS_PER_SITE=6 \
    -DLLVM_ENABLE_TERMINFO=OFF \
    -DLLVM_PARALLEL_LINK_JOBS=14 \
    /build/llvm-project/llvm
$ ninja

# Here I do some PGO profiles with Linux kernels
$ apk add bash bison curl coreutils diffutils elfutils-dev findutils flex make openssl-dev patch perl xz
$ git clone https://github.com/ClangBuiltLinux/tc-build.git /build/tc-build
$ /build/tc-build/kernel/build.sh --pgo -t "ARM;AArch64;X86" -b /build/build/llvm

$ ../stage1/bin/llvm-profdata merge -output /build/build/llvm/profdata.prof profiles/*.profraw

# Stage 3

$ mkdir ../stage3 && cd ../stage3

$ cmake \
    -G Ninja \
    -Wno-dev \
    -DCLANG_ENABLE_ARCMT=OFF \
    -DCLANG_ENABLE_STATIC_ANALYZER=OFF \
    -DCLANG_PLUGIN_SUPPORT=OFF \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_OCAMLDOC=OFF \
    -DLLVM_INCLUDE_DOCS=OFF \
    -DLLVM_INCLUDE_EXAMPLES=OFF \
    -DCMAKE_AR=/build/build/llvm/stage1/bin/llvm-ar \
    -DCMAKE_C_COMPILER=/build/build/llvm/stage1/bin/clang \
    -DCLANG_TABLEGEN=/build/build/llvm/stage1/bin/clang-tblgen \
    -DCMAKE_CXX_COMPILER=/build/build/llvm/stage1/bin/clang++ \
    -DLLVM_USE_LINKER=/build/build/llvm/stage1/bin/ld.lld \
    -DLLVM_TABLEGEN=/build/build/llvm/stage1/bin/llvm-tblgen \
    -DCMAKE_RANLIB=/build/build/llvm/stage1/bin/llvm-ranlib \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-alpine-linux-musl \
    -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;lld;polly;bolt" \
    -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
    -DCOMPILER_RT_BUILD_CRT=OFF \
    -DCOMPILER_RT_BUILD_XRAY=OFF \
    -DCOMPILER_RT_BUILD_GWP_ASAN=OFF \
    -DLLVM_TARGETS_TO_BUILD="ARM;AArch64;X86" \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_INSTALL_PREFIX=/build/install \
    -DLLVM_PROFDATA_FILE=/build/build/llvm/profdata.prof \
    -DLLVM_ENABLE_LTO=Thin \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,--emit-relocs" \
    -DLLVM_ENABLE_TERMINFO=OFF \
    -DLLVM_PARALLEL_LINK_JOBS=14 \
    /build/llvm-project/llvm
$ ninja

I've been using 14 link jobs to try to circumvent this, but it hasn't changed anything compared to 48.

I am guessing that this is somehow related to musl. I vaguely remember that it has a lower thread stack size than glibc, would it be possible that this is the reason for the error? Adding -DCMAKE_EXE_LINKER_FLAGS="-Wl,-z,stack-size=2097152" doesn't help, but maybe I misunderstand how I would raise the stack size.

efriedma-quic commented 1 year ago

An "out of memory" error means malloc failed. On Linux with overcommit enabled, malloc can't really fail unless you try to allocate more memory than the system has in one allocation. So probably a miscompile or something like that, nothing to do with the amount of memory your system has.

DimitryAndric commented 1 year ago

alternatively, because musl is different, it could be that its malloc does fail if there is some general lack of memory. :)