llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.97k stars 11.54k forks source link

[flang] Possible regression: fatal internal error: CHECK(range_.Contains(at)) #102495

Open BwL1289 opened 1 month ago

BwL1289 commented 1 month ago
fatal internal error: CHECK(range_.Contains(at)) failed at /tmp/<redacted>/staging/llvm_toolchain/source/llvm-project-82f52d9c42d926e23955b42128abff064825d6c8/flang/lib/Parser/provenance.cpp(474)

This looks to be a regression of #77791.

We are building scipy from commit: 76bf366486b0583dca789823878b80d610eb571b. Same behavior on tagged release 1.14.0.

The platform is Amazon Linux 2023 (Fedora based). Checked on both x86_64 (Intel Sapphire Rapids) and aarch64 (AWS Graviton4, neoverse-v2).

See attached log: flang_llvm_scipy.txt

Tagging @h-vetinari as you're on the front lines with flang and scipy :)

llvmbot commented 1 month ago

@llvm/issue-subscribers-flang-frontend

Author: Benjamin Leff (BwL1289)

``` fatal internal error: CHECK(range_.Contains(at)) failed at /tmp/<redacted>/staging/llvm_toolchain/source/llvm-project-82f52d9c42d926e23955b42128abff064825d6c8/flang/lib/Parser/provenance.cpp(474) ``` This looks to be a regression of #77791. We are building scipy from commit: `76bf366486b0583dca789823878b80d610eb571b`. Same behavior on tagged release `1.14.0`. The platform is Amazon Linux 2023 (Fedora based). Checked on both x86_64 (Intel Sapphire Rapids) and aarch64 (AWS Graviton4, neoverse-v2). See attached log: [flang_llvm_scipy.txt](https://github.com/user-attachments/files/16550460/flang_llvm_scipy.txt) Tagging @h-vetinari as you're on the front lines with flang and scipy :)
kkwli commented 1 month ago

We encounter a similar problem when building on AIX with our downstream clang as the build compiler. However, if I switch to use the upstream clang (18.1.8), the problem goes away. I haven't had much success to pinpoint where the problem is. I suspect the file is miscompiled but I may be wrong.

BwL1289 commented 1 month ago

@kkwli flang is broken on llvm 18 (see comment).

This means for us it's not an option to rollback.

BwL1289 commented 1 month ago

FWIW, appears that the regression was reintroduced in this commit.

See flang/include/flang/Parser/provenance.h and flang/lib/Parser/provenance.cpp.

klausler commented 1 month ago

@clementval

klausler commented 1 month ago

that commit doesn't have anything to do with the driver, though

clementval commented 1 month ago

Do you have a way to reproduce this?

BwL1289 commented 1 month ago

@clementval is that a question for me or @klausler?

Let me know how I can best support.

clementval commented 1 month ago

@clementval is that a question for me or @klausler?

Let me know how I can best support.

For you @BwL1289. Do you have a reproducer or a way to reproduce the build failure if I want to do it locally?

h-vetinari commented 1 month ago

Tagging @h-vetinari as you're on the front lines with flang and scipy :)

Thanks for the ping. I've been building scipy with the flang 19 release candidates, and that works. You seem to be building flang 20 from main? If so, more power to you - the earlier we catch potential regressions, the better!

To that effect, @kiranchandramohan had mentioned on discourse:

Ideally if we can set up a CI that always tests main development branch with scipy on Windows then we can guard against regressions.

I've been working on getting support for (llvm-)flang into meson, but unfortunately https://github.com/mesonbuild/meson/pull/13323 narrowly missed the 1.5 release. It'll be in 1.6 but that'll still take a few month AFAIU. Once meson 1.6 is out, it should be relatively easy to set up the mentioned CI.

BwL1289 commented 1 month ago

@clementval here're the steps to reproduce:

  1. We run our builds in docker based on amazonlinux:2023 image (Fedora-based OS).
  2. We're using "Development Tools" group from dnf which installs GCC (v11.4.1 20230605 (Red Hat 11.4.1-2) and GNU binutils (2.39.6).
  3. When we build GCC v14.2 and GNU binutils v2.43 (from release branches, commits 897cd794d341a3bdd3195e90ebeea054ac80bf65 and 8659b9b492124d7c45282b578c3279fb00c433ee, respectively).
  4. We build LLVM stage 1 with the following flags (w/o flang at this point as GCC fails to build). See stage_1.txt.
  5. We build LLVM stage 2 using just-built stage 1 with these flags. See stage_2.txt.
  6. Next we're building OpenBLAS. The same error happens with 2 following configurations, so don't think it's OpenBLAS problem: commit: 5bdd3a05f020af83d9e5f943c233e4ca510e87fd tagged release: 0.3.28
  7. Now we install the buildtime dependencies required for numpy and scipy. We use commits for meson and meson_python as tagged releases don't yet have the flang-new support (thanks to @h-vetinari for making it possible!).
    pip3 install --no-binary :all: cython pythran pybind11 meson_python@git+https://github.com/mesonbuild/meson-python.git@d93d4de2d56bacf6fd32f3ee3f18494ea38d05f0 meson@git+https://github.com/mesonbuild/meson.git@43b80e02ce0e87dfcf069111e62ad8eff4435d6e
  8. Next, we build NumPy (successfully)
    
    export CC="clang"
    export CXX="clang++"
    export CPP="clang-cpp"
    export FC="flang-new" # unused here
    export LD=ld.lld

pip3 install \ numpy==2.0.1 \ -Csetup-args="-Db_colorout=always" \ # 1. General: \ --no-build-isolation \ --no-deps numpy \ -Csetup-args="-Dbuildtype=release" \ -Csetup-args="-Dc_std=gnu17" \ -Csetup-args="-Dcpp_std=gnu++23" \ # 4. Dependencies: \ -Csetup-args="-Dwrap_mode=nofallback" \ -Csetup-args="-Dblas=openblas" \ -Csetup-args="-Dblas-order=openblas" \ -Csetup-args="-Dallow-noblas=false" \ -Csetup-args="-Dlapack=openblas" \ -Csetup-args="-Dlapack-order=openblas"


9. To build scipy we use the following command:
```bash
export CC="clang"
export CXX="clang++"
export CPP="clang-cpp"
export FC="flang-new"
export LD=ld.lld

pip3 install \
   scipy@git+https://github.com/scipy/scipy.git@76bf366486b0583dca789823878b80d610eb571b \
   -Csetup-args="-Db_colorout=always" \
   `# 1. General:` \
   --no-build-isolation \
   -Csetup-args="-Dbuildtype=release" \
   -Csetup-args="-Dc_std=gnu17" \
   -Csetup-args="-Dcpp_std=gnu++17" \
   `# 4. Dependencies:` \
   -Csetup-args="-Dwrap_mode=nofallback" \
   -Csetup-args="-Dblas=openblas" \
   -Csetup-args="-Dlapack=openblas" \
   -Csetup-args="-Duse-pythran=true" \
   -Csetup-args="-Dfortran_std=none" `# Required for flang-new`

This should reproduce the error

clementval commented 1 month ago

Wow. I'll try to look at this next week. On which file exactly the error is triggered?

BwL1289 commented 1 month ago

provenance.cpp. The full log is in the original post here.

We are blocked on this so the support is appreciated!

clementval commented 1 month ago

provenance.cpp. The full log is in the original post here.

We are blocked on this so the support is appreciated!

I meant which file of scipy is failing?

clementval commented 1 month ago

Looks like it fails on this file from PROPACK

https://github.com/scipy/PROPACK/blob/300f803c5ac3372ceb65ba446c71c90f128814b1/double/dlanbpro.F

Just compiling this file with top of tree I cannot reproduce the error. I'll need to look deeper.

BwL1289 commented 1 month ago

Looks like it fails on this file from PROPACK

https://github.com/scipy/PROPACK/blob/300f803c5ac3372ceb65ba446c71c90f128814b1/double/dlanbpro.F

Just compiling this file with top of tree I cannot reproduce the error. I'll need to look deeper.

That's correct and thank you again for the help.

clementval commented 1 month ago

Can you provide your version of this file?

../scipy/sparse/linalg/_propack/PROPACK/double/dlanbpro.F

BwL1289 commented 1 month ago

Can you provide your version of this file?

../scipy/sparse/linalg/_propack/PROPACK/double/dlanbpro.F

It's going to take some time to regenerate it, but I am noticing that it's actually happening in many different places in PROPACK not just that file.

clementval commented 1 month ago

Can you provide your version of this file? ../scipy/sparse/linalg/_propack/PROPACK/double/dlanbpro.F

It's going to take some time to regenerate it, but I am noticing that it's actually happening in many different places in PROPACK not just that file.

If you can provide one of these file that is failing it would be nice. I tried to build PROPACK locally at the revision specified by scipy and I did not see any error. scipy is probably doing smth to the sources.

BwL1289 commented 1 month ago

Will do. Thanks. LLVM introduced a regression in a recent commit that's preventing from getting to the this stage in the build.

I am going to submit a bug report and get back to you with the versioned files.

BwL1289 commented 1 month ago

@clementval here is our version of ../scipy/sparse/linalg/_propack/PROPACK/double/dlanbpro.F.

I ran a diffcheck between this file and the file in the git submod in scipy and it appears the only differences are trailing whitespaces in many places throughout the file.

BwL1289 commented 1 month ago

And here are the flang recommended versions (crash reproducer) to report:

  1. dlanbpro_1.txt
  2. dlanbpro_sh.txt

We are able to reproduce the error trying to build this file directly on this llvm commit: 20b2c9f10fe09f2c5cbd3da7f0af8df24f62e899

Also note: we are able to compile arbitrary fortran files without error, so the error seems related to this file.

Here are the steps to reproduce:

git clone <prolapack git link>
cd <into prolapack>
flang-new double/dlanbpro.F
bash-5.2# flang-new -v -v -v double/dlanbpro.F
flang-new version 20.0.0git
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/llvm_toolchain/bin
Found candidate GCC installation: /opt/gnu_toolchain/lib/gcc/x86_64-unknown-linux-gnu/14.2.1
Selected GCC installation: /opt/gnu_toolchain/lib/gcc/x86_64-unknown-linux-gnu/14.2.1
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-12.6, version 
 "/opt/llvm_toolchain/bin/flang-new" -fc1 -triple x86_64-unknown-linux-gnu -emit-obj -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -target-cpu x86-64 -resource-dir /opt/llvm_toolchain/lib/clang/20 -mframe-pointer=all -o /tmp/dlanbpro-2a03d4.o -x f95-cpp-input double/dlanbpro.F

fatal internal error: CHECK(range_.Contains(at)) failed at /tmp/<redacted>/staging/llvm_toolchain/source/llvm-project-20b2c9f10fe09f2c5cbd3da7f0af8df24f62e899/flang/lib/Parser/provenance.cpp(474)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /opt/llvm_toolchain/bin/flang-new -fc1 -triple x86_64-unknown-linux-gnu -emit-obj -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -target-cpu x86-64 -resource-dir /opt/llvm_toolchain/lib/clang/20 -mframe-pointer=all -o /tmp/dlanbpro-2a03d4.o -x f95-cpp-input double/dlanbpro.F
 #0 0x0000560b17110668 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/llvm_toolchain/bin/flang-new+0x2635668)
 #1 0x0000560b1710d95e llvm::sys::RunSignalHandlers() (/opt/llvm_toolchain/bin/flang-new+0x263295e)
 #2 0x0000560b1711123a SignalHandler(int) Signals.cpp:0:0
 #3 0x00007fcf0fdfedd0 __restore_rt (/lib64/libc.so.6+0x54dd0)
 #4 0x00007fcf0fe4b53c __pthread_kill_implementation (/lib64/libc.so.6+0xa153c)
 #5 0x00007fcf0fdfed26 gsignal (/lib64/libc.so.6+0x54d26)
 #6 0x00007fcf0fdd27f3 abort (/lib64/libc.so.6+0x287f3)
 #7 0x0000560b18bb55d6 (/opt/llvm_toolchain/bin/flang-new+0x40da5d6)
 #8 0x0000560b18ab642b Fortran::parser::AllSources::MapToOrigin(Fortran::parser::Provenance) const (/opt/llvm_toolchain/bin/flang-new+0x3fdb42b)
 #9 0x0000560b18ab6fc2 Fortran::parser::AllSources::GetInclusionInfo(std::__1::optional<Fortran::common::Interval<Fortran::parser::Provenance>> const&) const (/opt/llvm_toolchain/bin/flang-new+0x3fdbfc2)
#10 0x0000560b1759ba2c (anonymous namespace)::FirConverter::genLocation(Fortran::parser::CharBlock const&) Bridge.cpp:0:0
#11 0x0000560b17997a67 Fortran::lower::defineCommonBlocks(Fortran::lower::AbstractConverter&, std::__1::vector<std::__1::pair<Fortran::common::Reference<Fortran::semantics::Symbol const>, unsigned long>, std::__1::allocator<std::__1::pair<Fortran::common::Reference<Fortran::semantics::Symbol const>, unsigned long>>> const&) (/opt/llvm_toolchain/bin/flang-new+0x2ebca67)
#12 0x0000560b176330d0 (anonymous namespace)::FirConverter::createGlobalOutsideOfFunctionLowering(std::__1::function<void ()> const&) Bridge.cpp:0:0
#13 0x0000560b1759550b Fortran::lower::LoweringBridge::lower(Fortran::parser::Program const&, Fortran::semantics::SemanticsContext const&) (/opt/llvm_toolchain/bin/flang-new+0x2aba50b)
#14 0x0000560b17153b52 Fortran::frontend::CodeGenAction::beginSourceFileAction() (/opt/llvm_toolchain/bin/flang-new+0x2678b52)
#15 0x0000560b1714d681 Fortran::frontend::FrontendAction::beginSourceFile(Fortran::frontend::CompilerInstance&, Fortran::frontend::FrontendInputFile const&) (/opt/llvm_toolchain/bin/flang-new+0x2672681)
#16 0x0000560b1712b25b Fortran::frontend::CompilerInstance::executeAction(Fortran::frontend::FrontendAction&) (/opt/llvm_toolchain/bin/flang-new+0x265025b)
#17 0x0000560b1715279a Fortran::frontend::executeCompilerInvocation(Fortran::frontend::CompilerInstance*) (/opt/llvm_toolchain/bin/flang-new+0x267779a)
#18 0x0000560b16c16186 fc1_main(llvm::ArrayRef<char const*>, char const*) (/opt/llvm_toolchain/bin/flang-new+0x213b186)
#19 0x0000560b16c13d57 main (/opt/llvm_toolchain/bin/flang-new+0x2138d57)
#20 0x00007fcf0fde9eb0 __libc_start_call_main (/lib64/libc.so.6+0x3feb0)
#21 0x00007fcf0fde9f60 __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff60)
#22 0x0000560b16bf2ea5 _start (/opt/llvm_toolchain/bin/flang-new+0x2117ea5)
flang-new: error: unable to execute command: Aborted (core dumped)
flang-new: error: flang frontend command failed due to signal (use -v to see invocation)
flang-new version 20.0.0git
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/llvm_toolchain/bin
flang-new: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
flang-new: note: diagnostic msg: /tmp/dlanbpro-004a7f
flang-new: note: diagnostic msg: /tmp/dlanbpro-004a7f.sh
flang-new: note: diagnostic msg: 

********************
BwL1289 commented 1 month ago

@clementval we've traced the error to this section.

Based on this part of the stack dump, it seems to be related to aforementioned commit.

#8 0x0000560b18ab642b Fortran::parser::AllSources::MapToOrigin(Fortran::parser::Provenance) const (/opt/llvm_toolchain/bin/flang-new+0x3fdb42b)
#9 0x0000560b18ab6fc2 Fortran::parser::AllSources::GetInclusionInfo(std::__1::optional<Fortran::common::Interval<Fortran::parser::Provenance>> const&) const (/opt/llvm_toolchain/bin/flang-new+0x3fdbfc2)
#10 0x0000560b1759ba2c (anonymous namespace)::FirConverter::genLocation(Fortran::parser::CharBlock const&) Bridge.cpp:0:0

Let me know how I can best help to resolve.

clementval commented 1 month ago

I tried to reproduce the error at the commit you pointed me to but it works fine for me. Are you using a different version of stat.h that these files include? The error is coming from an inclusion of a file that we follow for the location information. Does one of the include directory on the command line has a stat.h?

BwL1289 commented 1 month ago

No, we are not using a different version of stat.h.

Can you provide a reproducible example so we can check how you're able to compile successfully?

clementval commented 1 month ago

I'm trying to compile the file with the same command line arguments:

flang-new -Iscipy/sparse/linalg/_propack/liblib__cpropack.a.p -I../scipy/sparse/linalg/_propack -DNDEBUG -D_FILE_OFFSET_BITS=64 -Wall -O3 -mno-outline-atomics -flto=full  -fcolor-diagnostics -fPIC -U_OPENMP -module-dir scipy/sparse/linalg/_propack/liblib__cpropack.a.p -c dlanbpro.F 

I had to remove some arguments as they are unknown to flang

flang (LLVM option parsing): Unknown command line argument '-polly'.  Try: 'flang (LLVM option parsing) --help'
flang (LLVM option parsing): Did you mean '--color'?
flang (LLVM option parsing): Unknown command line argument '-polly-vectorizer=stripmine'.  Try: 'flang (LLVM option parsing) --help'
flang (LLVM option parsing): Did you mean '--slp-vectorize-hor=stripmine'?
clementval commented 1 month ago

Are you able to give me the preprocess file?

You can use -E to run only the preprocessor.

flang-new -E ...
gorloffslava commented 1 month ago

@clementval, I'll work on providing these files ASAP.

Meanwhile - I was able to further trace down this issue. If LLVM stage 2 is built w/o LLVM_ENABLE_LTO, flang-new works w/o any problems for scipy, PROPACK, and in general.

clementval commented 1 month ago

I can provide a fix that will likely remove the issue on your side but it would be cool to be able to reproduce it to make sure.

gorloffslava commented 1 month ago

@clementval, we'll return to you with more details and files within a day. Currently running another build with final checks (LTO + fat lto objects).

Currently, problem is reproducible with flang built with both Full and Thin LTO w/o fat objects. It makes me think that problem in miscompilation of stage 2 flang with stage 1 clang.

clementval commented 1 month ago

You might want to try with this patch

https://github.com/llvm/llvm-project/pull/104281

BwL1289 commented 1 month ago

Will give this a try and report back. Thank you for the support.

gorloffslava commented 1 month ago

@clementval This fix works! Really big thanks for your help.

With this fix, all combinations of Full and Thin LTO, Fat and usual objects work. Without this fix, it fails if flang is built with any LTO, either fat objects are enabled or not.

Either ScPy is built w/ or w/o LTO doesn't matter in both cases.

I've also tried to build the samples and tests from LLVM (and beyond). W/O this fix, flang fails on any include directive if built w/ LTO. Not sure why it happens, but probably there is some problem in LTO that makes flang to be miscompiled.

Will rebuild w/o patch again to gather requested artifacts.

kkwli commented 1 month ago

Just another piece of information ... on AIX, the patch makes the assert go away while compiling Lower/location.f90 but the output verification fails. The current output (with the patch) is:

%7 = fir.call @_FortranAioOutputAscii(%2, %5, %6) fastmath<contract> : (!fir.ref<i8>, !fir.ref<i8>, i64) -> i1 loc(fused<#fir<loc_kind_array[ base,  inclusion]>>["llvm-project/flang/test/Lower/location1.inc":1:10, "llvm-project/flang/test/Lower/location0.inc":1:1])

and it is expected:

! CHECK: fir.call @_FortranAioOutputAscii(%{{.*}}, %{{.*}}, %{{.*}}) fastmath<contract> : (!fir.ref<i8>, !fir.ref<i8>, i64) -> i1 loc(fused<#fir<loc_kind_array[ base,  inclusion,  inclusion]>>["{{.*}}location1.inc":1:10, "{{.*}}location0.inc":1:1, "{{.*}}location.f90":4:1])

It looks like the second "inclusion" is missing. The build and test work fine on Linux.

clementval commented 1 month ago

Mmm that's weird. I would not expect the provenance information to be different on a different system. @kkwli Do you mean that the test is currently failing on AIX?

kkwli commented 1 month ago

Mmm that's weird. I would not expect the provenance information to be different on a different system. @kkwli Do you mean that the test is currently failing on AIX?

@clementval Yes, on AIX. Without the patch, it asserts; with the patch, it has different output.

clementval commented 1 month ago

@gorloffslava @BwL1289 The fix has been merged. There is smth weird happening with the provenance information on some system but you should not get an assertion anymore.

BwL1289 commented 1 month ago

Thank you @clementval. Appreciate your help!

clementval commented 1 month ago

@BwL1289 I'll let you close the issue if the the current main works for you.

BwL1289 commented 1 month ago

@clementval thanks. Do you believe it's best to also communicate this issue to the maintainers responsible for LTO? Without the fix, this problem was only observed if flang was built with LTO (discussed above).

BwL1289 commented 3 weeks ago

@clementval thanks. Do you believe it's best to also communicate this issue to the maintainers responsible for LTO? Without the fix, this problem was only observed if flang was built with LTO (discussed above).

@clementval following up ^. Not sure what's best here. Appreciate the guidance.

clementval commented 2 weeks ago

@clementval thanks. Do you believe it's best to also communicate this issue to the maintainers responsible for LTO? Without the fix, this problem was only observed if flang was built with LTO (discussed above).

@clementval following up ^. Not sure what's best here. Appreciate the guidance.

Yeah it might be good to get there insight on the issue,