Allow building stage 1+ with the LLVM backend

bollu commented 10 months ago

This PR changes the build system and outlines uses of extern inline to enable LLVM stage2 builds. Draft PR till this branch actually succeeds.

Kha commented 10 months ago

The, hopefully, last remaining error is quite puzzling: this works

"D:/a/lean4/lean4/build/stage1/leanc.sh" D:/a/lean4/lean4/build/stage1/shell/CMakeFiles/shell.dir/lean.cpp.obj D:/a/lean4/lean4/build/stage1/shell/CMakeFiles/shell.dir/manifest.rc.obj -lleanshared    -O3 -DNDEBUG -o D:/a/lean4/lean4/build/stage1/bin/lean.exe

but this fails:

D:/a/lean4/lean4/build/stage1/leanc.sh -o "D:/a/lean4/lean4/build/stage1/bin/leanc.exe" D:/a/lean4/lean4/build/stage1/temp/Leanc.o -O3 -DNDEBUG -lleanshared  -v
 "D:/a/lean4/lean4/build/stage1/bin/ld.lld" --sysroot=D:/a/lean4/lean4/build/stage1 -m i386pep -Bdynamic -o D:/a/lean4/lean4/build/stage1/bin/leanc.exe D:/a/lean4/lean4/build/stage1/lib/crt2.o D:/a/lean4/lean4/build/stage1/lib/crtbegin.o -LD:/a/lean4/lean4/build/stage1/lib -LD:/a/lean4/lean4/build/stage1/lib/lean -LD:/a/lean4/lean4/build/stage1/x86_64-w64-mingw32/lib -LD:/a/lean4/lean4/build/stage1/x86_64-w64-mingw32/mingw/lib -LD:/a/lean4/lean4/build/stage1/lib -LD:/a/lean4/lean4/build/stage1/x86_64-w64-mingw32/sys-root/mingw/lib -LD:/a/lean4/lean4/build/stage1/lib/clang/15.0.1/lib/windows D:/a/lean4/lean4/build/stage1/temp/Leanc.o -lleanshared -Bstatic -lgmp -lunwind -Bdynamic -lgmp -lucrtbase --stack 104857600 -lm -lbcrypt -lmingw32 D:/a/lean4/lean4/build/stage1/lib/clang/15.0.1/lib/windows/libclang_rt.builtins-x86_64.a -lmoldname -lmingwex -lpthread -ladvapi32 -lshell32 -luser32 -lkernel32 -lmingw32 D:/a/lean4/lean4/build/stage1/lib/clang/15.0.1/lib/windows/libclang_rt.builtins-x86_64.a -lmoldname -lmingwex -lkernel32 D:/a/lean4/lean4/build/stage1/lib/crtend.o
ld.lld: error: undefined symbol: l_Lean_Compiler_FFI_getCFlags
>>> referenced by D:/a/lean4/lean4/build/stage1/temp/Leanc.o:(l_main___lambda__4)

...but the command lines at least should be identical on master

Kha commented 10 months ago

!bench

Kha commented 10 months ago

Due to new commits on master, we have a new test failure because of new warnings:

[06:57 (00.000005)] lean-test-stage1> > /nix/store/h1sl8hi6wp2i435d06cwpz9y3dhmdiil-lean-bin-tools/include/lean/lean.h:1942:79: warning: omitting the parameter name in a function definition is a C2x extension [-Wc2x-extensions]
[06:57 (00.000006)] lean-test-stage1> > static inline lean_object* lean_expr_checker_get_max_ctor_fields(lean_object *) {
[06:57 (00.000005)] lean-test-stage1> >                                                                               ^
...

We could work on filtering out these warnings but we should probably instead make sure the header is valid C99(?) /cc @bollu

Kha commented 10 months ago

Let's please also use the correct lean_object * typedefs in these functions

leanprover-bot commented 10 months ago

Here are the benchmark results for commit 74fe86fb9079d1c0046700146c4a202ba55156bb. There were significant changes against commit 367b38701fb9cca80328624a016c3b3acfd5e2cd:

  Benchmark                  Metric         Change
  ============================================================
+ binarytrees                instructions    -1.1% (-1289.9 σ)
+ binarytrees.st             instructions    -1.1% (-3141.3 σ)
- liasolver                  maxrss           3.6%   (180.1 σ)
- qsort                      maxrss           3.6%   (180.1 σ)
- qsort                      task-clock       1.4%    (10.4 σ)
- qsort                      wall-clock       1.4%    (10.1 σ)
- rbmap                      task-clock       7.9%    (14.7 σ)
- rbmap                      wall-clock       7.8%    (14.7 σ)
+ rbmap_fbip                 branches        -1.7% (-3896.7 σ)
- tests/bench/ interpreted   instructions     2.9%  (6167.1 σ)
- unionfind                  task-clock       4.1%    (12.5 σ)
- unionfind                  wall-clock       4.1%    (12.4 σ)

Kha commented 10 months ago

Interesting, this is the diff with LLVM disabled. But since especially rbmap should not use any of the outlined functions, it is hard to see any causal relationship with this PR. I will turn LLVM back on for now and do a run with it.

Kha commented 10 months ago

!bench

leanprover-bot commented 10 months ago

Here are the benchmark results for commit da47e71de1a0d744cb4c21ab115a179c6f6196cf. There were significant changes against commit 367b38701fb9cca80328624a016c3b3acfd5e2cd:

  Benchmark                  Metric         Change
  ============================================================
+ binarytrees                instructions    -1.1%  (-756.7 σ)
+ binarytrees.st             instructions    -1.1%  (-296.2 σ)
- liasolver                  maxrss           3.6%   (178.1 σ)
- qsort                      maxrss           3.6%   (185.1 σ)
- rbmap                      task-clock       7.6%    (26.7 σ)
- rbmap                      wall-clock       7.6%    (27.1 σ)
+ rbmap_fbip                 branches        -1.7% (-1542.6 σ)
- tests/bench/ interpreted   instructions     2.9%  (9562.3 σ)

Kha commented 10 months ago

My bad, the benchmarking setup does not actually use LLVM yet...

Kha commented 10 months ago

Nice, looks like it's only missing visibility modifiers on initializers (https://github.com/leanprover/lean4/actions/runs/5642004685/job/15281100057?pr=2340). @bollu Could you fix that and then put the Emit changes in a separate PR so we can do an update-stage0 on master in between?

bollu commented 10 months ago

@Kha I assume I need an update-stage0 for the windows build? I’m not sure what’s up with the Linux release lake failure.

Kha commented 10 months ago

Hmm, the test might be flaky but twice in a row and never before is a bit concerning. Anyways, if Windows passes, please go ahead with the Emit PR

hargoniX commented 10 months ago

I cannot reproduce the failures on my local setup :(

bollu commented 10 months ago

@Kha I asked @hargoniX to rerun the linux tests on their machine, where it did succeed. Could we rerun the linux release CI? I would be troubled if leanlaketest_116 fails once more.

bollu commented 10 months ago

@Kha I squash-merged this branch against the latest upstream/master. I resolved the stage0 changes by using the versions of the files in upstream/master . This build fails on aarch64: https://github.com/bollu/lean4/actions/runs/5660012113/job/15334806801.

I guess we need a second stage0 update with our CMake changes to propagate the CMake changes which correctly setup --target?

bollu commented 10 months ago

(To be clear, I would like to understand how to structure this PR so we can start reviewing it!)

Kha commented 10 months ago

I squash-merged this branch against the latest upstream/master

I would prefer independent parts to stay as separate commits; it is fine to squash away incremental work internal to the PR. You can manually copy src/CMakeLists.txt to stage0/ which I believe should fix CI, it is only the standard library that we always copy with the script.

bollu commented 10 months ago

I don't have a good enough mental model of the CMake build to successfully disambiguate all the changes we've made :D I can split this up into three unrelated commits: (1) nix changes, (2) outlining, (3) "build system changes" [all the changes to CMake plus the backporting into stage0]. Hope this works for you!

tobiasgrosser commented 10 months ago

Maybe the easiest is to see if merging master (while kipping the stage0 from master), plus copying src/CMakeLists.txt to stage0 already reduces the size of this PR quite a bit?

bollu commented 10 months ago

@Kha can we have a bench to make sure we do not regress performance?

I am hoping that https://github.com/leanprover/lean4/pull/2358 will alleviate the flakiness of the Linux release build.

Kha commented 10 months ago

!bench

leanprover-bot commented 10 months ago

Here are the benchmark results for commit c8a169a4df683b6f47bd0e10d26b15f3944b9a16. There were significant changes against commit 53477089fe255280d7a816371ec315c6e2e94220:

  Benchmark                  Metric             Change
  ================================================================
+ binarytrees                instructions        -1.1%   (-55.4 σ)
+ binarytrees.st             instructions        -1.1% (-3280.0 σ)
+ rbmap_fbip                 branches            -1.7% (-3162.0 σ)
+ rbmap_fbip                 task-clock          -6.8%   (-10.1 σ)
+ rbmap_fbip                 wall-clock          -6.8%   (-10.1 σ)
+ stdlib                     tactic execution    -1.6%   (-13.7 σ)
+ stdlib                     task-clock          -1.4%   (-17.1 σ)
- tests/bench/ interpreted   instructions         2.9%  (5962.0 σ)
- unionfind                  task-clock           3.9%    (78.2 σ)
- unionfind                  wall-clock           3.8%    (75.1 σ)

Kha commented 10 months ago

Oh jeez, the larger timeout actually fixed it... that points to a significant performance regression compared to the C backend then though

tydeu commented 10 months ago

@Kha / @bollu

Oh jeez, the larger timeout actually fixed it... that points to a significant performance regression compared to the C backend then though

Hopefully, the timeout only really needed a 1 second bump and, if so, the performance impact is not so extreme (and there is a good chance that 3 seconds may have already been just barely enough).

tydeu commented 10 months ago

Also, for the return values of the extern functions. you may wish to use lean_obj_res rather than lean_object* to follow the usual style.

Kha commented 10 months ago

Before we merge, we will have to switch back to the C backend anyway. I would suggest creating a copy of the "Linux release" CI job and set LLVM=ON for it and only it for now. We can disable the test for that specific CI job then and consider it part of future performance investigations of the LLVM backend.

Kha commented 10 months ago

But also we should fix that test to not require a timeout :)

bollu commented 10 months ago

@Kha I would like to cover the following triples:

Vanilla LLVM build on linux release.
One cross compile / darwin build (MacOS aarch64).
One windows build for windows linker things (Windows).

Let us haggle on which of these we should have a LLVM=ON build copy 😄

Kha commented 10 months ago

We already have too many CI jobs that we run on each PR. If you want to see the backend be tested for each configuration, let's work on making it the default :) .

tobiasgrosser commented 10 months ago

That's a good incentive!

bollu commented 10 months ago

We already have too many CI jobs that we run on each PR. If you want to see the backend be tested for each configuration, let's work on making it the default :) .

@Kha that sounds good to me. May I know what would be necessary to flip on LLVM by default across all backends? From the milestone list (https://github.com/leanprover/lean4/milestone/3) , I presume that what is missing is:

(1) pulling in the patch that correctly initializes size_t. We can test this on a non 64-bit target in a VM after pulling in https://github.com/leanprover/lean4/pull/2090. This should also fix https://github.com/leanprover/lean4/issues/2085.

(2) making the LLVM backend have performance parity across all tests. I would like to know how to run the benchmark suite locally, and how I could gather a list of tests where we see a performance drop? From the current !bench, I can see that tests/bench/ and unionfind need to be investigated:

Here are the benchmark results for commit c8a169a. There were significant changes against commit 5347708:

  Benchmark                  Metric             Change
  ================================================================
- tests/bench/ interpreted   instructions         2.9%  (5962.0 σ)
- unionfind                  task-clock           3.9%    (78.2 σ)
- unionfind                  wall-clock           3.8%    (75.1 σ)

Is that an exhaustive list of milestones before we can flip on LLVM by default?

Kha commented 10 months ago

May I know what would be necessary to flip on LLVM by default across all backends?

Ideally it should be strictly better than the C backend! I.e. no known regressions and at least some notable benefits. One minor benefit would be that we shouldn't have to distribute libclang anymore, which is about 7% of the unpacked toolchain. Unless, that is, the ecosystem started depending on there being a C compiler, which I haven't checked.

I would like to know how to run the benchmark suite locally, and how I could gather a list of tests where we see a performance drop?

Yes, running the Speedcenter suite locally with and without LLVM is the way to go for now I believe. Please tell me if there is anything missing from https://github.com/leanprover/lean4/blob/master/tests/bench/README.md. Note that, as I said, !bench is not yet set up to use LLVM at all.

Kha commented 10 months ago

Is that an exhaustive list of milestones before we can flip on LLVM by default?

I added everything that directly came to mind to the milestone

hargoniX commented 10 months ago

The thing that is making the CI fail on aarch64 is fixed in the current shell.cpp, I guess we need an update-stage0 for this?

Kha commented 10 months ago

The weak workaround has yet to be reinstated. While doing that, we should document what it does and why it is necessary. I would also welcome opinions on whether there are alternative approaches to it, it is quite a departure from the C backend after all and the repercussions are not clear to me.

edit: for the sake of completeness, my original message on this topic:

We removed the static from these functions in lean.h.bc, but in order to sucessfully link together resulting object files, we must not expose the functions as global symbols (i.e. make sure they are eliminated or make them hidden or weak)
/nix/store/fx1h3mmzl2hjci3nn15n26r57mc3cs3z-binutils-2.40/bin/ld: /home/sebastian/lean/lean/build/release/stage2/lib/lean/libLean.a(Widget.o): in function `lean_set_st_header':
Lean.Widget:(.text+0x4b0): multiple definition of `lean_set_st_header'; /home/sebastian/lean/lean/build/release/stage2/lib/lean/libInit.a(EState.o):Init.Control.EState:(.text+0xdf0): first defined here

tobiasgrosser commented 10 months ago

We removed the static from these functions in lean.h.bc, but in order to sucessfully link together resulting object files, we must not expose the functions as global symbols (i.e. make sure they are eliminated or make them hidden or weak)
/nix/store/fx1h3mmzl2hjci3nn15n26r57mc3cs3z-binutils-2.40/bin/ld: /home/sebastian/lean/lean/build/release/stage2/lib/lean/libLean.a(Widget.o): in function `lean_set_st_header':
Lean.Widget:(.text+0x4b0): multiple definition of `lean_set_st_header'; /home/sebastian/lean/lean/build/release/stage2/lib/lean/libInit.a(EState.o):Init.Control.EState:(.text+0xdf0): first defined here
I guess we want to set the function linkage type to internal then:

https://llvm.org/docs/LangRef.html#linkage-types

Similar to private, but the value shows as a local symbol (STB_LOCAL in the case of ELF) in the object file. This corresponds to the notion of the ‘static’ keyword in C.

These are the relevant functions: https://llvm.org/doxygen/GlobalValue_8h_source.html#l00532

Kha commented 10 months ago

Sounds good. At which step of the pipeline should we do that, after loading the clang-generated bitcode but before linking it with the Lean backend output? I assume the overhead of redoing this (in memory) for every Lean file should be negligible.

tobiasgrosser commented 10 months ago

Sounds good. At which step of the pipeline should we do that, after loading the clang-generated bitcode but before linking it with the Lean backend output? I assume the overhead of redoing this (in memory) for every Lean file should be negligible.

I am not sure I have enough context to answer this. Why did we remove the static in the first place?

Kha commented 10 months ago

The unfortunate problem is that clang will not generate any bitcode at all if we give it lean.h as is with static inlines. So we remove the static for generating lean.h.bc and now just have to revert that change later somewhere on the bitcode level.

tobiasgrosser commented 10 months ago

The unfortunate problem is that clang will not generate any bitcode at all if we give it lean.h as is with static inlines. So we remove the static for generating lean.h.bc and now just have to revert that change later somewhere on the bitcode level.

I see. As a workaround, we could run 'opt -internalize' on the resulting .bc file, assuming all functions should be internalized.

Kha commented 10 months ago

That sounds handy, but the docs seem to imply it doesn't do anything if there is no main? https://llvm.org/docs/Passes.html#internalize-internalize-global-symbols

tobiasgrosser commented 10 months ago

That sounds handy, but the docs seem to imply it doesn't do anything if there is no main? https://llvm.org/docs/Passes.html#internalize-internalize-global-symbols

I teste it and it works even if there is no main function.

Kha commented 10 months ago

Awesome, let me try that right now

Kha commented 10 months ago

Heh, perhaps not unexpectedly that already breaks linking in the header with the backend output. So we would have to selectively internalize these symbols after this linking step.

tobiasgrosser commented 10 months ago

There is a clang option named -femit-all-decls that should force clang to also emit static inline functions. Is this maybe what we want?

Kha commented 10 months ago

That's also good to know but it probably won't solve the issue of LLVMLinkModules2 ignoring internal symbols, no?

tobiasgrosser commented 10 months ago

I thought about this a little more. LLVMLinkModules2 does not expose functionality to internalize definitions while linking and the high-level C++ interface used in llvm-linker also requires a callback that does the actual internalization:

L.linkInModule(
          std::move(M), ApplicableFlags, [](Module &M, const StringSet<> &GVS) {
            internalizeModule(M, [&GVS](const GlobalValue &GV) {
              return !GV.hasName() || (GVS.count(GV.getName()) == 0);
            });
          });

Hence, our best bet would probably after calling LLVMLinkModule2 to iterate over the resulting module and call void LLVMSetLinkage(LLVMValueRef Global, LLVMLinkage Linkage); with LLVMExternalLinkage on any function that is also contained in the llvm.h.bc file?

tobiasgrosser commented 10 months ago

@bollu is on it, so let's see if that works.

bollu commented 10 months ago

Here is the lean.h.bc file: https://gist.github.com/bollu/5700b2b693bfe1de2addaa2957699f6d

I must admit, I find the failure quite puzzling. My best guess is that I'm using the API wrong, so I'm going to write a small test C++/python script to check that we are indeed using the API correctly to grab functions in the module.

tobiasgrosser commented 10 months ago

Hey @Siddharth, I feel that we first need to revert @Kha's patch that internalizes the functions before linking. We meanwhile learned that this will prevent any of them to be linked into the module. Second, .str is private so likely won't be linked. Did you try to just drop the exception and continue even if some symbols are not found?

leanprover / lean4

Allow building stage 1+ with the LLVM backend #2340