Open Baltoli opened 4 months ago
@JuanCoRo asks:
In the Using a Local Build section, the point is to get away from nix builds. Does that mean that's best not to use kup as well? That is, we shouldn't do
kup install k --version v$(cat kasmer-multiversx/deps/k_release)
but always resort to:
$ cd ../k $ git checkout v$(cat ../kasmer-multiversx/deps/k_release) $ git submodule update --init --recursive $ mvn package -DskipTests $ cd ../kasmer-multiversx $ export PATH="/Users/brucecollie/code/k/k-distribution/target/release/k/bin:$PATH"
?
The point is to get away from Nix builds of K if you think you might want to tinker with K internals, or if the bug is likely to be in K itself. If the bug is actually more likely to be in your semantics / ecosystem project, then using kup
is likely to be easier. I realise in hindsight that in this case I didn't end up really using the relevant features of a local K build, and in this case I might as well have used kup
to perform the local build of the semantics. This is muscle memory of mine from a pre-kup
era!
To help me understand better, could you give an example (even if hand-wavy) of a K tinkering that would make building K with
kup
not appropriate when compared with the method described in the issue? I'm trying to understand wherekup
would fall short for debugging/tinkering purposes
Fundamentally, kup
with local checkouts and overrides is equally as powerful as a local build of K - you should always be able to assemble the composite version that you want using kup
, direct Nix overrides, or a local checkout.
Where I find the local checkout works a bit better is when I'm changing multiple components and want to iterate quickly - kup
and nix build
have a tendency to recompile everything from scratch if you make a change, whereas mvn package
will generally run a lot faster. For example, imagine I make a change to the LLVM backend, frontend and Haskell backend in parallel (as I did during this debugging process!) - incremental compilation means that's a 1 minute iteration vs. 5 to rebuild with Nix.
The other case is using poetry to point your semantics at an ad-hoc local checkout of Pyk to add logging or make changes to the underlying Pyk code when you're working:
kframework = { path = "/your/local/dir/k/pyk", develop = true }
If you're in a deeply nested context like Kontrol or Kasmer, this is annoying to do via Nix but trivial using a local build.
If you had to pass -tailcallopt to the code generator to get a tailcc function to emit a tail call when calling another musttail tailcc function, that is a bug in llvm and should be minimally reproduced in a .ll file and reported upstream to llvm. The fix you describe is probably fine as a workaround solution if it indeed solves the problem though
Alternatively, if the call site isn't musttail, maybe that's the problem?
The fix you describe is probably fine as a workaround solution if it indeed solves the problem though.
Yeah, I was aware when writing up that this fix isn't totally in the spirit of your TCO PR, but in the interests of getting my fix merged upstream I opted for the concise band-aid fix.
Alternatively, if the call site isn't musttail, maybe that's the problem?
Almost certainly so: this case would probably be pretty easy to track down the call site for and fix in the style of your PR. You'd probably want to also run through the codebase to see if there are any other places that morally should be musttail
, but aren't.
Introduction
This issue is an informal walkthrough of how I solved a tricky issue in K recently. I'm writing this from the position of the issue in question being fixed, so while I'm confident that everything I've written here is correct, I may be slightly paraphrasing or abbreviating my thought process from the time that I actually fixed the bug.
The issue in question touches pretty much the entire stack of tools we use at RV, from an ecosystem-specific semantics project all the way down to the internal details of the LLVM backend. It should therefore (hopefully!) be a good resource to consult when something's broken and it's not immediately clear how to proceed.
Preliminaries
I intend for this guide to be useful to anyone at RV. Even if you've never worked directly on K, almost all of our ecosystem projects touch it in some way, and having some more context on what might be breaking when you see an error should at least help you to do better initial triage and referral of your issues. If you do work directly on K, then hopefully you'll be able to take away some useful tips for working and debugging across the entire project in practice, rather than just on small examples.
Where possible, I will include details of how you can reproduce the individual steps, or at least a pointer to the code I wrote to try and do so. Please read this walkthrough in detail if you have any interest at all, and if there's anything that's not clear, or if you have follow-up questions for me, please ask!
Thanks to @goodlyrottenapple and @jberthold for their help pinning this issue down originally.
Walkthrough
Initial Report
The first evidence of this issue is a somewhat innocuous message from @jberthold when we were debugging a separate issue in the Kasmer project:
Originally, we'd been debugging an issue that manifested during concrete execution, where the LLVM backend would fail to rewrite a program correctly. After diagnosing and fixing that issue, the Kasmer CI was able to get further through its test suite, but still failed on one configuration during a smoke test of symbolic execution and proving (example CI failure is here):
From the CI error message, it's not clear precisely what's going wrong, or even what part of the stack is responsible: the Pyk code responsible for orchestrating a proof in Kasmer is receiving an empty response from the Haskell Backend, but exactly what's causing that isn't obvious.
Local Reproduction
Nix CI
The first step to take when investigating any kind of issue is to reproduce it locally on your own machine by recreating as similar an environment as possible to the failing CI run.
In the case of this issue, the failing CI interface looks something like this:
which lets us pin down the relevant section of workflow YAML:
and therefore to a command that we can run locally to see if the failure happens:
In this case, the failure reproduces easily on my machine with this exact command and I can begin narrowing down exactly what causes it.
Using a Local Build
Nix builds have their advantages (i.e. that dependencies are completely pinned, and the build environment is hermetic and reproducible). However, they can be tricky to use for local development because of how the build system works.[^2] I therefore suggest trying to extract a build that uses non-Nix dependencies that also reproduces the original issue; this will generally make it a lot easier to tinker with K / Pyk code as you narrow down on what the problem is.
To do so, what I usually do is check out the appropriate version of K locally and set the PATH variable in your project directory to point at it:[^3]
We can see from the project's
flake.nix
that Kasmer is built usingkdist
, and so armed with that local version of K we should be able to build it with:[^4]Once this finishes, we can then run the same test command that we were previously running inside the Nix development shell:
On my machine, this happily also reproduced the issue, and so we can leave the Nix configuration behind for the time being.
Bug Reporting
At this point, it's worth taking a step back and looking at what the error messages from the Kasmer smoke test are telling us is happening. We can see that there's a Python stack trace:
which suggests that something is wrong during symbolic execution; specifically that Pyk is getting an empty response back from the KORE RPC server when it makes an execute request for some configuration.
The issue here looks like something crashing in the Haskell backend, so I assembled a bug report from the failing execution of Kasmer.[^5] A bug report here is a set of files zipped together that can be used to replay the exact request-response cycle that triggers the bug. It takes a bit of setup to get a bug report replaying in the backend, so I asked @goodlyrottenapple to take a look at the example I'd collected while I looked at some other work.
LLVM Backend
Sam came back to me with the assessment that the requests in my bug report were not triggering a problem in the Haskell backend per se, but were in fact causing the LLVM backend shared library module to crash when the problematic execute request was sent. I verified this by running a slightly different version of the Kasmer command in the smoke test script:
This non-booster version runs slowly but successfully, which suggests that the LLVM backend is indeed the issue.[^6]
The Python stack trace doesn't give us any helpful information about why the backend crashed, so I began to wonder at this point what the easiest way to get an LLDB shell for the problematic booster process was. I weighed up a couple of different ideas, each of which would have required a reasonable amount of investigation and inconvenient setup code to get working properly:
lldb kore-rpc-booster
for me, with interactivity working properly.lldb kore-rpc-booster
myself in a shell, and have Kasmer connect to it via a known port rather than spawning a server myself.Fortunately, I found a much easier way to get a debugger shell that involved minimal setup code. This Stack Overflow answer suggests that it's possible to get LLDB to wait for a process with a particular name to launch, then attach to it:
That way, when
kasmer
eventually starts a booster server after running its preprocessing steps, we'll get dropped into an LLDB shell for the booster process, from where we can try to debug the issue.[^7] To summarise; in one shell start LLDB and wait for a process as above, then in another shell startkasmer
:When we run to the point where the server crashes, we finally have an informative stack trace that shows us what's happening:
Suspicious Symbols
Let's walk through this stack trace to try and get a better picture of what's actually going on in the LLVM backend. At the very bottom (#14) is the Haskell function that's calling into the LLVM backend; the API entrypoint that it reaches first to do so is the
kore_simplify
function, which takes a KORE term as input and performs concrete simplifications on it to produce a result term (essentially, this is a question of evaluating all the function symbols in the input term).Eventually, the backend needs to evaluate a rewrite rule (the one with ordinal 16290[^8]), and in doing so calls the hook
BYTES.int2bytes
. We can look up the C++ source code for this hook in the LLVM backend repo and confirm that it does in fact callmpz_clear
.[^9]This immediately looks suspicious to me; I know from previous experience that we replace the default memory management functions in the GMP library with our own GC-enabled ones. If for some reason the default free is called on a GC-allocated integer, we'll get a segfault much like this one. The previous way I'd seen this happen was if we didn't call
kllvm_init
after loading a shared bindings library. Using LLDB I quickly verified that we were calling this, and so the underlying reason must be different here.After scratching my head for a while and not getting anywhere, I noticed something odd about the stack trace:
In these lines, the
__gmp_*
symbols are coming from thekore-rpc-booster
executable, rather than from the GMP shared library. I confirmed that these symbols are in fact in the binary:and found a dependency that looks plausibly like it might statically link GMP.[^10] Another old PR came to mind at this point: previously, we'd fixed an issue where a function called
step
in the LLVM backend would get mixed up with an identically-named function in libc. Perhaps the same is happening here.Minimal Example
At this point, I've got a reasonable hypothesis about what might be happening:
We can now make a minimal example using the LLVM backend's testing framework for the C shared library bindings. This K code:
should crash if compiled into a bindings library and executed on macOS. Happily, this is exactly what does happen, and we can be reasonably sure that the example is a faithful distillation of the original issue.
Research: macOS Dynamic Linking
Having validated this hypothesis, we now have to put the legwork in and do some research. I don't fully understand why the issue happens, but need to try and figure it out in order to fix things. I begin by looking at the
llvm-kompile-clang
script, which is an internal part of the backend's compiler that deals with invoking Clang properly to link the backend's runtime with a module generated from a semantics. A quickgrep
fordarwin
identifies a few places where we specialise behaviour for macOS, one of which looks relevant to the linker and symbol resolution issues we're investigating:After some time googling, it becomes clear that perhaps the
-flat_namespace
option is problematic; this issue suggests that enabling it causes symbols to be mixed up between different dynamically-linked libraries. This sounds very plausibly like what we're dealing with. I removed the option, rebuilt the backend, and ran the minimal test case from before: no crash! It appears that this option is indeed the problem.[^11]Fixing the Issue
With the problematic flag removed, I re-ran the LLVM backend test suite and find a couple of breakages:
Some debugging reveals that all three of these are instances of the same abstract problem: with flat namespaces enabled, we were accidentally linking two incompatible copies of the same library and relying on coincidentally getting the correct symbol resolved by the dynamic loader; when the correct symbols get resolved the incompatibility is exposed. The fix in each case is to only link one canonical copy of each library.[^12]
After doing so, we can re-run the test suite:
The LLVM backend test suite is now working, so I push my branch up to GitHub and watch the CI runs to get a sense of how it's doing on other platforms.
Stack Overflow
Everything seems fine from the perspective of the LLVM backend's CI, so the next step to take is to retry the original (slow) reproduction from the Kasmer project.[^13] Because we've modified the LLVM backend, we need to tell Nix to point at that version rather than the version locked by Kasmer's default flake inputs:
We're getting the same error again despite our fix being applied and the entire K toolchain having been rebuilt.[^14] It seems like we're back at square one on our debugging journey despite having fixed the issue in the minimal reproduction of the original problem. My first thought at this point is that there must be some property of the original Kasmer issue that I failed to capture properly in the minimal reproduction. However, using the same remote LLDB shell trick from earlier reveals a stack trace that looks a bit different to before:
This trace looks suspiciously like a stack overflow (that is, we've made too many recursive calls, and the process has run over the end of the stack space that the operating system has given it - doing so produces
EXC_BAD_ACCESS
on macOS). However, it does suggest that the execution is getting further than it was when we were crashing because of__gmpz_clear
.@jberthold helps me identify (on Slack) that this isn't anything to do with GHC's worker threads or Haskell compiler settings, and so I poke a bit more at the call stack and see that each frame in the backtrace is trying to print a symbol like
Lbl'UndsUndsUnds'WASM-COMMON-SYNTAX'Unds'Instrs'Unds'Instr'Unds'Instrs{}
. This suggests that it's a legitimate stack overflow encountered when printing a large K cons list out.Research: Backend TCO Changes
I suspect at this point that there may be something else that's changed in the LLVM backend, so I look at the range of K versions that the original problematic PR is changing:
Then, you can use GitHub's compare tool to see what's different between these K versions: https://github.com/runtimeverification/k/compare/v7.1.30...v7.1.70. We're interested in the LLVM backend, and can see that it's been bumped by a few versions:
Looking at the commits in that range, there's one smoking gun that jumps out as an obvious candidate: https://github.com/runtimeverification/llvm-backend/pull/1097. This PR makes some changes to how tail-call optimisation in the LLVM backend is applied, which seems very likely to be involved in a new stack overflow appearing when there wasn't one before.
Fixing the Issue (Part 2)
The fix for this issue turned out to be reasonably straightforward. Per Dwight's PR description:
The PR diff identifies two such cases where we add the
musttail
attribute to function calls, neither of which correspond to thevisit_children
call from the stack trace. The PR suggests that it removes the need to apply-tailcallopt
from LLVM, and so I do a quick grep through the LLVM source code to identify what that option actually does internally. I realise that it (as well as IR-level effects on the generated LLVM) sets a flag for the low-level code generator that we currently aren't setting in the backend:I enable this flag, and set the default optimisation level for Kasmer to
-O1
[^15], then push my changes and retry the original issue reproduction again:The test passes!
Success and Final Steps
At this point, the issue in the original reproduction is fixed, and I can compile my changes into a coherent PR, ready for review: https://github.com/runtimeverification/llvm-backend/pull/1109.
Summary
This is a long read, but I'd be really grateful for any feedback, comments or thoughts. If there's anything at all where I've glossed over a detail, or you'd like some more insight into how I reached a conclusion or solved a problem, please comment. I want to leave this document as a useful resource for RVers, and so if there's anything I can possibly add to it I'd be really pleased.
[^1]: This is tempting fate, obviously. [^2]: We can largely leave Nix behind soon in this guide, but if you'd like to know more about how our Nix infrastructure works then @goodlyrottenapple is the right person to talk to. [^3]: Using direnv is a nice way to automate this. [^4]: This will take a very long time (20+ minutes on some machines); it's a known issue with this project. [^5]: I did have to patch the Kasmer project to get this to work properly; I would say that for any project using the symbolic backend in earnest, it's critical that the bug reporting feature is wired up to the command-line interface properly. [^6]: The non-booster version doesn't load an LLVM backend shared library; it executes using pure Haskell. [^7]: Note that if you end up using this technique in practice, you'll need to disable bug reporting. When Pyk populates a bug report, it runs
kore-rpc-booster --version
before spawning the actual server, so you'll end up attached to the version process. Getting LLDB and a bug report is left as an exercise to the reader and will require some modification to Pyk. [^8]: The LLVM backend assigns every rewrite rule (axiom, really) in a KORE definition a unique integer identifier (ordinal). I didn't ever actually need to do this to debug the issue, but you can usellvm-kompile-compute-loc
to tell you what rule has a given axiom. [^9]: The details aren't critical, but I believempz_clear
is a macro wrapper around the internal function__gmpz_clear
. I didn't need to go further into the weeds than this but they're 1:1 in correspondence as far as I can tell. [^10]: @goodlyrottenapple made a heroic effort at this point to try and excise the static dependency from the booster, but it was too tightly baked into GHC to be viable. [^11]: Feel free to follow up with me in more detail if you're interested in the internals of the macOS dynamic linker; this description is really under-selling how much research it took to convince myself of what the morally correct solution was here! It's just not directly relevant to the K discussion, so I'm omitting the goriest parts. [^12]: The debugging here is standard C++ debugging; I'm again glossing over some details to stick to the "K stack" story. I'm happy to explain more! [^13]: Remember that? Feels like a long time ago, right? [^14]: At this point, by reading, you get to save approximately an entire day of developer time. [^15]: Per Dwight's PR: "Convert TailCallElimination and Mem2Reg from required to optional passes on-O0
."; at-O0
we'll still get the huge stack trace because TCO won't be enabled everywhere.