runtimeverification / k

K Framework Tools 7.0
BSD 3-Clause "New" or "Revised" License
452 stars 150 forks source link

Guide: Debugging an issue across the RV stack #4542

Open Baltoli opened 4 months ago

Baltoli commented 4 months ago

Introduction

This issue is an informal walkthrough of how I solved a tricky issue in K recently. I'm writing this from the position of the issue in question being fixed, so while I'm confident that everything I've written here is correct, I may be slightly paraphrasing or abbreviating my thought process from the time that I actually fixed the bug.

The issue in question touches pretty much the entire stack of tools we use at RV, from an ecosystem-specific semantics project all the way down to the internal details of the LLVM backend. It should therefore (hopefully!) be a good resource to consult when something's broken and it's not immediately clear how to proceed.

Preliminaries

I intend for this guide to be useful to anyone at RV. Even if you've never worked directly on K, almost all of our ecosystem projects touch it in some way, and having some more context on what might be breaking when you see an error should at least help you to do better initial triage and referral of your issues. If you do work directly on K, then hopefully you'll be able to take away some useful tips for working and debugging across the entire project in practice, rather than just on small examples.

[!IMPORTANT] Callouts like this will highlight the most important general principles that become apparent from this walkthrough.

Where possible, I will include details of how you can reproduce the individual steps, or at least a pointer to the code I wrote to try and do so. Please read this walkthrough in detail if you have any interest at all, and if there's anything that's not clear, or if you have follow-up questions for me, please ask!

Thanks to @goodlyrottenapple and @jberthold for their help pinning this issue down originally.

Walkthrough

Initial Report

The first evidence of this issue is a somewhat innocuous message from @jberthold when we were debugging a separate issue in the Kasmer project:

... This changes the outcome of the smoke test, but unfortunately still causes a failure on MacOS ... (_@jberthold on Slack_)

Originally, we'd been debugging an issue that manifested during concrete execution, where the LLVM backend would fail to rewrite a program correctly. After diagnosing and fixing that issue, the Kasmer CI was able to get further through its test suite, but still failed on one configuration during a smoke test of symbolic execution and proving (example CI failure is here):

... truncated ...
  File "/nix/store/2wzcb0rmjkhi9wbwii3rzkpgplhrw45g-python3.10-kframework-7.1.59/lib/python3.10/site-packages/pyk/kore/rpc.py", line 356, in request
    raise RuntimeError('Empty response received')
RuntimeError: Empty response received
</generatedTop>

{ true #Equals 0 <=Int ARG_0:Int }
KApply(label=KLabel(name='#Equals', params=(KSort(name='Bool'), KSort(name='GeneratedTopCell'))), args=(KToken(token='true', sort=KSort(name='Bool')), KApply(label=KLabel(name='_<=Int_', params=()), args=(KToken(token='0', sort=KSort(name='Int')), KVariable(name='ARG_0', sort=KSort(name='Int'))))))

requires: 0 <=Int ARG_0:Int

Error: Process completed with exit code 1.

From the CI error message, it's not clear precisely what's going wrong, or even what part of the stack is responsible: the Pyk code responsible for orchestrating a proof in Kasmer is receiving an empty response from the Haskell Backend, but exactly what's causing that isn't obvious.

[!IMPORTANT] It's really useful to have a general picture of our active projects, and the dependency graph between them in your head, for a couple of reasons:

  • Generally, it gives you a rough idea of where the root cause of an issue might be coming from, and therefore who might know about a potential fix.
  • If you're a K maintainer, it's useful to occasionally have a quick skim through active projects to see if anything's been broken without an issue being raised.

In this case, the project in question is Kasmer-MultiversX. This is a semantics project that has a deeply-nested dependency chain (Kasmer depends on the core MultiversX semantics, which themselves depend on the WebAssembly semantics, and from there on K).

Local Reproduction

Nix CI

The first step to take when investigating any kind of issue is to reproduce it locally on your own machine by recreating as similar an environment as possible to the failing CI run.

[!IMPORTANT] Being able to interpret the source code of our GitHub Actions setup is important; knowing what commands are actually being run in CI means that it's much easier to pin down a reproducible example. Every project we maintain has a .github/workflows directory that contains YAML files defining what runs in CI; if you see something failing in CI then searching the YAML workflow files for a string you see in the Github UI is usually a good place to get started.

In the case of this issue, the failing CI interface looks something like this:

Screenshot 2024-07-22 at 13 51 21

which lets us pin down the relevant section of workflow YAML:

  nix:
    name: 'Nix'
    strategy:
      fail-fast: false
      matrix:
        include:
          - runner: [self-hosted, linux, normal]
          - runner: self-macos-14
    needs: [python-code-quality-checks]
    runs-on: ${{ matrix.runner }}
    steps:
      - name: 'Check out code'
        uses: actions/checkout@v3
        with:
          submodules: recursive
      - name: 'Build Kasmer'
        run: GC_DONT_GC=1 nix build .#kmxwasm --extra-experimental-features 'nix-command flakes' --print-build-logs
      - name: 'Run smoke tests'
        run: GC_DONT_GC=1 nix develop .#kmxwasm-test-shell --extra-experimental-features 'nix-command flakes' --print-build-logs --command './package/smoke-test.sh'
      - name: 'Run claim generation tests'
        run: GC_DONT_GC=1 nix develop .#kmxwasm-test-shell --extra-experimental-features 'nix-command flakes' --print-build-logs --command './generate-claims.sh'

and therefore to a command that we can run locally to see if the failure happens:

$ GC_DONT_GC=1 nix develop .#kmxwasm-test-shell --extra-experimental-features 'nix-command flakes' --print-build-logs --command './package/smoke-test.sh'

In this case, the failure reproduces easily on my machine with this exact command and I can begin narrowing down exactly what causes it.

[!NOTE] We got a bit lucky with this issue in that the failure is exhibited in a Nix CI job. Generally, these examples are really easy to reproduce on a local machine because of how Nix isolates build dependencies and environments. A problem would need to be extremely esoteric before it fails to reproduce in a Nix environment on the same OS.[^1]

If your issue isn't in a Nix build on CI, you might need to do some additional setup steps to get a local environment that reproduces the issue. With luck, you can just run the same commands that CI runs on your local machine, but in some cases you might need to set up a Docker image that reproduces the image being run in CI. I won't go into that here as it's not on the critical path, but if you're interested in knowing more about this please let me know.

Using a Local Build

Nix builds have their advantages (i.e. that dependencies are completely pinned, and the build environment is hermetic and reproducible). However, they can be tricky to use for local development because of how the build system works.[^2] I therefore suggest trying to extract a build that uses non-Nix dependencies that also reproduces the original issue; this will generally make it a lot easier to tinker with K / Pyk code as you narrow down on what the problem is.

To do so, what I usually do is check out the appropriate version of K locally and set the PATH variable in your project directory to point at it:[^3]

$ cd ../k
$ git checkout v$(cat ../kasmer-multiversx/deps/k_release)
$ git submodule update --init --recursive
$ mvn package -DskipTests
$ cd ../kasmer-multiversx
$ export PATH="/Users/brucecollie/code/k/k-distribution/target/release/k/bin:$PATH"

We can see from the project's flake.nix that Kasmer is built using kdist, and so armed with that local version of K we should be able to build it with:[^4]

$ poetry -C kmxwasm run kdist -v build -j$(nproc) "mxwasm-semantics.*" "mx-semantics.llvm-kasmer"
/Users/brucecollie/.cache/kdist-02f0c24/wasm-semantics/source
/Users/brucecollie/.cache/kdist-02f0c24/mx-semantics/plugin
/Users/brucecollie/.cache/kdist-02f0c24/mx-semantics/source
/Users/brucecollie/.cache/kdist-02f0c24/mxwasm-semantics/source
/Users/brucecollie/.cache/kdist-02f0c24/mxwasm-semantics/lemma-proofs
/Users/brucecollie/.cache/kdist-02f0c24/mxwasm-semantics/lemma-tests
/Users/brucecollie/.cache/kdist-02f0c24/mxwasm-semantics/haskell
/Users/brucecollie/.cache/kdist-02f0c24/mxwasm-semantics/llvm
/Users/brucecollie/.cache/kdist-02f0c24/mx-semantics/llvm-kasmer
/Users/brucecollie/.cache/kdist-02f0c24/mxwasm-semantics/llvm-library

Once this finishes, we can then run the same test command that we were previously running inside the Nix development shell:

$ poetry -C kmxwasm shell
$ ./package/smoke-test.sh
... truncated ...
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 982, in _request
    return self._client.request(method, **params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 302, in request
    return self._default_client.request(method, **params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 356, in request
    raise RuntimeError('Empty response received')
RuntimeError: Empty response received

On my machine, this happily also reproduced the issue, and so we can leave the Nix configuration behind for the time being.

Bug Reporting

At this point, it's worth taking a step back and looking at what the error messages from the Kasmer smoke test are telling us is happening. We can see that there's a Python stack trace:

Traceback (most recent call last):
  File "/Users/brucecollie/Library/Caches/pypoetry/virtualenvs/kmxwasm-RLFYbBcZ-py3.12/bin/kasmer", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/brucecollie/code/mx-backend/kmxwasm/src/kmxwasm/kasmer/__main__.py", line 55, in main
    kasmer(sys.argv[1:])
  File "/Users/brucecollie/code/mx-backend/kmxwasm/src/kmxwasm/kasmer/__main__.py", line 68, in kasmer
    return _kasmer_verify(opts)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/mx-backend/kmxwasm/src/kmxwasm/kasmer/__main__.py", line 90, in _kasmer_verify
    kasmer_verify(
  File "/Users/brucecollie/code/mx-backend/kmxwasm/src/kmxwasm/kasmer/verify.py", line 47, in kasmer_verify
    action.run()
  File "/Users/brucecollie/code/mx-backend/kmxwasm/src/kmxwasm/property.py", line 158, in run
    raise result.exception
  File "/Users/brucecollie/code/mx-backend/kmxwasm/src/kmxwasm/property_testing/running.py", line 266, in run_claim
    extend_result = tools.explorer.extend_cterm(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kcfg/explore.py", line 231, in extend_cterm
    cterm, next_states, depth, vacuous, next_node_logs = self.cterm_symbolic.execute(
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/cterm/symbolic.py", line 129, in execute
    response = self._kore_client.execute(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 1045, in execute
    result = self._request('execute', **params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 982, in _request
    return self._client.request(method, **params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 302, in request
    return self._default_client.request(method, **params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brucecollie/code/k/pyk/src/pyk/kore/rpc.py", line 356, in request
    raise RuntimeError('Empty response received')
RuntimeError: Empty response received

which suggests that something is wrong during symbolic execution; specifically that Pyk is getting an empty response back from the KORE RPC server when it makes an execute request for some configuration.

The issue here looks like something crashing in the Haskell backend, so I assembled a bug report from the failing execution of Kasmer.[^5] A bug report here is a set of files zipped together that can be used to replay the exact request-response cycle that triggers the bug. It takes a bit of setup to get a bug report replaying in the backend, so I asked @goodlyrottenapple to take a look at the example I'd collected while I looked at some other work.

[!IMPORTANT] If you're working with K it's important that (even if you don't understand in depth how each component works) you understand the broad architecture of how the different pieces fit together. For modern K projects, this usually entails an ecosystem-specific layer of Pyk code that orchestrates calls to one of the K backends (the Haskell backend via an RPC layer, and the LLVM backend via a set of native-code Python bindings). There are a lot of different projects involved when a semantics runs, so it's useful to get a sense of where the data is going.

LLVM Backend

Sam came back to me with the assessment that the requests in my bug report were not triggering a problem in the Haskell backend per se, but were in fact causing the LLVM backend shared library module to crash when the problematic execute request was sent. I verified this by running a slightly different version of the Kasmer command in the smoke test script:

$ kasmer verify test_call_add

This non-booster version runs slowly but successfully, which suggests that the LLVM backend is indeed the issue.[^6]

[!NOTE] The Haskell backend uses the LLVM backend heavily to perform concrete simplification and function evaluation. To do so, we compile a definition into a shared library that the Haskell backend can load and make calls to via dlopen / dlsym.

The Python stack trace doesn't give us any helpful information about why the backend crashed, so I began to wonder at this point what the easiest way to get an LLDB shell for the problematic booster process was. I weighed up a couple of different ideas, each of which would have required a reasonable amount of investigation and inconvenient setup code to get working properly:

Fortunately, I found a much easier way to get a debugger shell that involved minimal setup code. This Stack Overflow answer suggests that it's possible to get LLDB to wait for a process with a particular name to launch, then attach to it:

(lldb) process attach --name kore-rpc-booster --waitfor

That way, when kasmer eventually starts a booster server after running its preprocessing steps, we'll get dropped into an LLDB shell for the booster process, from where we can try to debug the issue.[^7] To summarise; in one shell start LLDB and wait for a process as above, then in another shell start kasmer:

$ kasmer -C tests/contracts/test_adder_passing verify test_call_add --booster

When we run to the point where the server crashes, we finally have an informative stack trace that shows us what's happening:

Process 98079 resuming
Process 98079 stopped
* thread #3, name = 'ghc_worker', stop reason = signal SIGABRT
    frame #0: 0x000000019a3bea60 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x19a3bea60 <+8>:  b.lo   0x19a3bea80    ; <+40>
    0x19a3bea64 <+12>: pacibsp
    0x19a3bea68 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x19a3bea6c <+20>: mov    x29, sp
(lldb) bt
* thread #3, name = 'ghc_worker', stop reason = signal SIGABRT
  * frame #0: 0x000000019a3bea60 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x000000019a3f6c20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000019a303a30 libsystem_c.dylib`abort + 180
    frame #3: 0x000000019a213dc4 libsystem_malloc.dylib`malloc_vreport + 896
    frame #4: 0x000000019a217430 libsystem_malloc.dylib`malloc_report + 64
    frame #5: 0x000000019a231494 libsystem_malloc.dylib`find_zone_and_free + 528
    frame #6: 0x0000000104fcd174 kore-rpc-booster`__gmp_default_free + 28
    frame #7: 0x0000000104fc6044 kore-rpc-booster`__gmpz_clear + 76
    frame #8: 0x000000010ab82d1c interpreter.dylib`hook_BYTES_int2bytes(len=<unavailable>, i=<unavailable>, endianness_ptr=0x000003d500000001) at bytes.cpp:105:3 [opt]
    frame #9: 0x000000010a9db7dc interpreter.dylib`apply_rule_16290 + 80
    frame #10: 0x000000010aa46100 interpreter.dylib`evaluate_function_symbol + 60820
    frame #11: 0x000000010ab6efdc interpreter.dylib`construct_initial_configuration(initial=<unavailable>) at ConfigurationParser.cpp:156:24 [opt]
    frame #12: 0x000000010abbab14 interpreter.dylib`kllvm::bindings::simplify_to_term(std::__1::shared_ptr<kllvm::kore_pattern> const&, std::__1::shared_ptr<kllvm::kore_sort> const&) + 264
    frame #13: 0x0000000109f20688 interpreter.dylib`kore_simplify + 44
    frame #14: 0x0000000102477504 kore-rpc-booster`hszmbackendzmboosterzm0zi1zi29zm2n2C0cuusOcJL9DCRzzjtvB_BoosterziLLVMziInternal_zdfCustomUserEventLlvmCallzuzdceventType_info + 3868

Suspicious Symbols

Let's walk through this stack trace to try and get a better picture of what's actually going on in the LLVM backend. At the very bottom (#14) is the Haskell function that's calling into the LLVM backend; the API entrypoint that it reaches first to do so is the kore_simplify function, which takes a KORE term as input and performs concrete simplifications on it to produce a result term (essentially, this is a question of evaluating all the function symbols in the input term).

Eventually, the backend needs to evaluate a rewrite rule (the one with ordinal 16290[^8]), and in doing so calls the hook BYTES.int2bytes. We can look up the C++ source code for this hook in the LLVM backend repo and confirm that it does in fact call mpz_clear.[^9]

[!WARNING] At this point, we've reached one of the critical moments in the debugging process that relied on my accumulated time spent working on this code. I'm afraid there isn't a better way to explain my train of thought other than "I've seen something like this before", but hopefully the post-hoc explanation is good enough to justify the leap.

This immediately looks suspicious to me; I know from previous experience that we replace the default memory management functions in the GMP library with our own GC-enabled ones. If for some reason the default free is called on a GC-allocated integer, we'll get a segfault much like this one. The previous way I'd seen this happen was if we didn't call kllvm_init after loading a shared bindings library. Using LLDB I quickly verified that we were calling this, and so the underlying reason must be different here.

After scratching my head for a while and not getting anywhere, I noticed something odd about the stack trace:

    frame #6: 0x0000000104fcd174 kore-rpc-booster`__gmp_default_free + 28
    frame #7: 0x0000000104fc6044 kore-rpc-booster`__gmpz_clear + 76

In these lines, the __gmp_* symbols are coming from the kore-rpc-booster executable, rather than from the GMP shared library. I confirmed that these symbols are in fact in the binary:

$ nm -C $(which kore-rpc-booster) | grep __gmpz_clear
00000001045adff8 T ___gmpz_clear

and found a dependency that looks plausibly like it might statically link GMP.[^10] Another old PR came to mind at this point: previously, we'd fixed an issue where a function called step in the LLVM backend would get mixed up with an identically-named function in libc. Perhaps the same is happening here.

Minimal Example

At this point, I've got a reasonable hypothesis about what might be happening:

If the kore-rpc-booster executable contains a symbol called __gmpz_clear, then it's possible for the operating system's dynamic loader to resolve calls from the bindings shared library to __gmpz_clear to that symbol, rather than to the symbol from libgmp. This causes a segfault because the backend's statically linked library hasn't set up memory management to be compatible with the GC from the LLVM backend.

We can now make a minimal example using the LLVM backend's testing framework for the C shared library bindings. This K code:

module FLAT-NAMESPACE-SYNTAX
  imports BYTES-SYNTAX

  syntax Foo ::= foo()      [symbol(foo)]
               | bar(Bytes) [symbol(bar)]
endmodule

module FLAT-NAMESPACE
  imports DOMAINS
  imports BYTES
  imports FLAT-NAMESPACE-SYNTAX

  rule foo() => bar(Int2Bytes(10, 3489, BE))
endmodule

should crash if compiled into a bindings library and executed on macOS. Happily, this is exactly what does happen, and we can be reasonably sure that the example is a faithful distillation of the original issue.

[!IMPORTANT] There are lots of different setups for running tests across K and the various subprojects; it can be tricky trying to figure out the exact idioms for running tests for different projects. After years working on K I still occasionally find myself asking people how to run tests in a particular place, but it's worth generally having an idea how to run tests in different projects.

This section also highlights a really important principle: it's substantially easier to debug an issue if you've minimised it and extracted the key parts into a smaller reproduction. Here, instead of a 20-minute semantics recompile and remote LLDB session, I've turned the original issue into a single C+K combination that runs (and fails, importantly!) in the backend's own test harness. This makes my iteration time seconds rather than half an hour when experimenting.

Research: macOS Dynamic Linking

Having validated this hypothesis, we now have to put the legwork in and do some research. I don't fully understand why the issue happens, but need to try and figure it out in order to fix things. I begin by looking at the llvm-kompile-clang script, which is an internal part of the backend's compiler that deals with invoking Clang properly to link the backend's runtime with a module generated from a semantics. A quick grep for darwin identifies a few places where we specialise behaviour for macOS, one of which looks relevant to the linker and symbol resolution issues we're investigating:

if [[ "$OSTYPE" == "darwin"* ]]; then
  start_whole_archive="-force_load"
  end_whole_archive=""

  flags+=("-Wl,-flat_namespace" "-Wl,-undefined" "-Wl,dynamic_lookup")
else
  start_whole_archive="-Wl,--whole-archive"
  end_whole_archive="-Wl,--no-whole-archive"
fi

After some time googling, it becomes clear that perhaps the -flat_namespace option is problematic; this issue suggests that enabling it causes symbols to be mixed up between different dynamically-linked libraries. This sounds very plausibly like what we're dealing with. I removed the option, rebuilt the backend, and ran the minimal test case from before: no crash! It appears that this option is indeed the problem.[^11]

Fixing the Issue

With the problematic flag removed, I re-ran the LLVM backend test suite and find a couple of breakages:

$ lit test
... truncated ...
Failed Tests (3):
  llvm-backend :: c/div_by_zero.kore
  llvm-backend :: c/safe-partial.kore
  llvm-backend :: k-rule-apply/definition.kore

Some debugging reveals that all three of these are instances of the same abstract problem: with flat namespaces enabled, we were accidentally linking two incompatible copies of the same library and relying on coincidentally getting the correct symbol resolved by the dynamic loader; when the correct symbols get resolved the incompatibility is exposed. The fix in each case is to only link one canonical copy of each library.[^12]

After doing so, we can re-run the test suite:

$ lit test
... truncated ...
Testing Time: 69.97s

Total Discovered Tests: 246
  Passed: 246 (100.00%)

The LLVM backend test suite is now working, so I push my branch up to GitHub and watch the CI runs to get a sense of how it's doing on other platforms.

Stack Overflow

Everything seems fine from the perspective of the LLVM backend's CI, so the next step to take is to retry the original (slow) reproduction from the Kasmer project.[^13] Because we've modified the LLVM backend, we need to tell Nix to point at that version rather than the version locked by Kasmer's default flake inputs:

$ GC_DONT_GC=1 nix develop .#kmxwasm-test-shell --extra-experimental-features 'nix-command flakes' --print-build-logs --override-input k-framework/llvm-backend 'github:runtimeverification/llvm-backend/no-flat' --command './package/smoke-test.sh'
... truncated ...
RuntimeError: Empty response received

[!IMPORTANT] K is made up of a lot of moving parts, and when debugging an issue it's important to be able to construct a working K / downstream project installation that contains the components you're interested in at fixed versions. In the line above, I'm using Nix flake input overrides to point the LLVM backend at the version with our bug fix applied. There are other ways to do this: for example, you could check out a "local monorepo" of K and apply changes directly in submodules, and for Python dependencies Poetry will let you point at local checkouts. If you can't figure out how to produce "project X with dependency Y overridden", then ask in #k!

We're getting the same error again despite our fix being applied and the entire K toolchain having been rebuilt.[^14] It seems like we're back at square one on our debugging journey despite having fixed the issue in the minimal reproduction of the original problem. My first thought at this point is that there must be some property of the original Kasmer issue that I failed to capture properly in the minimal reproduction. However, using the same remote LLDB shell trick from earlier reveals a stack trace that looks a bit different to before:

* thread #5, name = 'ghc_worker', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f9d3d60)
  * frame #0: 0x000000010a4b1608 interpreter.dylib`is_symbol_a_binder + 4
    frame #1: 0x000000010a609f40 interpreter.dylib`serialize_configuration_internal(file=<unavailable>, subject=<unavailable>, sort=<unavailable>, is_var=<unavailable>, state_ptr=<unavailable>) at ConfigurationSerializer.cpp:470:20 [opt]
    frame #2: 0x000000010a56739c interpreter.dylib`visit_children + 234060
    ...
    ...many more...
    ...
    frame #162: 0x000000010a568988 interpreter.dylib`visit_children + 239672
    frame #163: 0x000000010a60a464 interpreter.dylib`serialize_configuration_internal(file=<unavailable>, subject=<unavailable>, sort=<unavailable>, is_var=<unavailable>, state_ptr=<unavailable>) at ConfigurationSerializer.cpp:489:3 [opt]
    frame #164: 0x000000010a60d318 interpreter.dylib`serialize_configuration(subject=<unavailable>, sort=<unavailable>, data_out=0x000000700684dba0, size_out=0x000000700684db88, emit_size=true, use_intern=<unavailable>) at ConfigurationSerializer.cpp:622:3 [opt]
    frame #165: 0x00000001099b1fdc interpreter.dylib`kore_simplify + 72
    frame #166: 0x000000010203365c kore-rpc-booster`hszmbackendzmboosterzm0zi1zi32zmACHOGdrCIuNDn16yL6qkno_BoosterziLLVMziInternal_zdfCustomUserEventLlvmCallzuzdceventType_info + 3868

This trace looks suspiciously like a stack overflow (that is, we've made too many recursive calls, and the process has run over the end of the stack space that the operating system has given it - doing so produces EXC_BAD_ACCESS on macOS). However, it does suggest that the execution is getting further than it was when we were crashing because of __gmpz_clear.

@jberthold helps me identify (on Slack) that this isn't anything to do with GHC's worker threads or Haskell compiler settings, and so I poke a bit more at the call stack and see that each frame in the backtrace is trying to print a symbol like Lbl'UndsUndsUnds'WASM-COMMON-SYNTAX'Unds'Instrs'Unds'Instr'Unds'Instrs{}. This suggests that it's a legitimate stack overflow encountered when printing a large K cons list out.

Research: Backend TCO Changes

I suspect at this point that there may be something else that's changed in the LLVM backend, so I look at the range of K versions that the original problematic PR is changing:

Screenshot 2024-07-22 at 20 43 30

Then, you can use GitHub's compare tool to see what's different between these K versions: https://github.com/runtimeverification/k/compare/v7.1.30...v7.1.70. We're interested in the LLVM backend, and can see that it's been bumped by a few versions:

Screenshot 2024-07-22 at 20 45 34

Looking at the commits in that range, there's one smoking gun that jumps out as an obvious candidate: https://github.com/runtimeverification/llvm-backend/pull/1097. This PR makes some changes to how tail-call optimisation in the LLVM backend is applied, which seems very likely to be involved in a new stack overflow appearing when there wasn't one before.

Fixing the Issue (Part 2)

The fix for this issue turned out to be reasonably straightforward. Per Dwight's PR description:

In order to make the tail call optimization still work correctly without TailCallElimination manually marking tail calls as tail, we instead explicitly mark them as musttail in the IR generated by the code generator.

The PR diff identifies two such cases where we add the musttail attribute to function calls, neither of which correspond to the visit_children call from the stack trace. The PR suggests that it removes the need to apply -tailcallopt from LLVM, and so I do a quick grep through the LLVM source code to identify what that option actually does internally. I realise that it (as well as IR-level effects on the generated LLVM) sets a flag for the low-level code generator that we currently aren't setting in the backend:

  options.GuaranteedTailCallOpt = true;

I enable this flag, and set the default optimisation level for Kasmer to -O1[^15], then push my changes and retry the original issue reproduction again:

$ GC_DONT_GC=1 nix develop .#kmxwasm-test-shell --extra-experimental-features 'nix-command flakes' --print-build-logs --override-input k-framework/llvm-backend 'github:runtimeverification/llvm-backend/no-flat' --command './package/smoke-test.sh'
... truncated ...
Node 25 took 5.379137992858887 sec.
Running the claim 120.73784112930298 sec.
Saving the kcfg 2.0197129249572754 sec.
Success
INFO:pyk.kore.rpc:Stopping KoreServer: :::56821, pid=74007
INFO:pyk.kore.rpc:KoreServer stopped: :::56821, pid=74007

The test passes!

Success and Final Steps

At this point, the issue in the original reproduction is fixed, and I can compile my changes into a coherent PR, ready for review: https://github.com/runtimeverification/llvm-backend/pull/1109.

[!IMPORTANT] There are a couple of important things to take care over even once a tricky problem is resolved. The first is to ensure your PR has a detailed writeup to ensure that in the future, anyone looking at this code can understand your reasoning and motivation for making a change. It's much harder to reverse engineer from an empty PR message!

The second is to babysit your changes if they need to reach a downstream project. Here, I was waiting for my PR to reach Kasmer, and so I made sure to keep an eye on and approve each of the dependency update PRs (to K, the WASM semantics, the MX semantics, and eventually Kasmer) in turn as they arrived to make sure they would pass CI. If you don't know how to do this or where your code will end up after it's released, please ask someone!

Summary

This is a long read, but I'd be really grateful for any feedback, comments or thoughts. If there's anything at all where I've glossed over a detail, or you'd like some more insight into how I reached a conclusion or solved a problem, please comment. I want to leave this document as a useful resource for RVers, and so if there's anything I can possibly add to it I'd be really pleased.

[^1]: This is tempting fate, obviously. [^2]: We can largely leave Nix behind soon in this guide, but if you'd like to know more about how our Nix infrastructure works then @goodlyrottenapple is the right person to talk to. [^3]: Using direnv is a nice way to automate this. [^4]: This will take a very long time (20+ minutes on some machines); it's a known issue with this project. [^5]: I did have to patch the Kasmer project to get this to work properly; I would say that for any project using the symbolic backend in earnest, it's critical that the bug reporting feature is wired up to the command-line interface properly. [^6]: The non-booster version doesn't load an LLVM backend shared library; it executes using pure Haskell. [^7]: Note that if you end up using this technique in practice, you'll need to disable bug reporting. When Pyk populates a bug report, it runs kore-rpc-booster --version before spawning the actual server, so you'll end up attached to the version process. Getting LLDB and a bug report is left as an exercise to the reader and will require some modification to Pyk. [^8]: The LLVM backend assigns every rewrite rule (axiom, really) in a KORE definition a unique integer identifier (ordinal). I didn't ever actually need to do this to debug the issue, but you can use llvm-kompile-compute-loc to tell you what rule has a given axiom. [^9]: The details aren't critical, but I believe mpz_clear is a macro wrapper around the internal function __gmpz_clear. I didn't need to go further into the weeds than this but they're 1:1 in correspondence as far as I can tell. [^10]: @goodlyrottenapple made a heroic effort at this point to try and excise the static dependency from the booster, but it was too tightly baked into GHC to be viable. [^11]: Feel free to follow up with me in more detail if you're interested in the internals of the macOS dynamic linker; this description is really under-selling how much research it took to convince myself of what the morally correct solution was here! It's just not directly relevant to the K discussion, so I'm omitting the goriest parts. [^12]: The debugging here is standard C++ debugging; I'm again glossing over some details to stick to the "K stack" story. I'm happy to explain more! [^13]: Remember that? Feels like a long time ago, right? [^14]: At this point, by reading, you get to save approximately an entire day of developer time. [^15]: Per Dwight's PR: "Convert TailCallElimination and Mem2Reg from required to optional passes on -O0."; at -O0 we'll still get the huge stack trace because TCO won't be enabled everywhere.

Baltoli commented 4 months ago

@JuanCoRo asks:

In the Using a Local Build section, the point is to get away from nix builds. Does that mean that's best not to use kup as well? That is, we shouldn't do

kup install k --version v$(cat kasmer-multiversx/deps/k_release)

but always resort to:

$ cd ../k
$ git checkout v$(cat ../kasmer-multiversx/deps/k_release)
$ git submodule update --init --recursive
$ mvn package -DskipTests
$ cd ../kasmer-multiversx
$ export PATH="/Users/brucecollie/code/k/k-distribution/target/release/k/bin:$PATH"

?

The point is to get away from Nix builds of K if you think you might want to tinker with K internals, or if the bug is likely to be in K itself. If the bug is actually more likely to be in your semantics / ecosystem project, then using kup is likely to be easier. I realise in hindsight that in this case I didn't end up really using the relevant features of a local K build, and in this case I might as well have used kup to perform the local build of the semantics. This is muscle memory of mine from a pre-kup era!

To help me understand better, could you give an example (even if hand-wavy) of a K tinkering that would make building K with kup not appropriate when compared with the method described in the issue? I'm trying to understand where kup would fall short for debugging/tinkering purposes

Fundamentally, kup with local checkouts and overrides is equally as powerful as a local build of K - you should always be able to assemble the composite version that you want using kup, direct Nix overrides, or a local checkout.

Where I find the local checkout works a bit better is when I'm changing multiple components and want to iterate quickly - kup and nix build have a tendency to recompile everything from scratch if you make a change, whereas mvn package will generally run a lot faster. For example, imagine I make a change to the LLVM backend, frontend and Haskell backend in parallel (as I did during this debugging process!) - incremental compilation means that's a 1 minute iteration vs. 5 to rebuild with Nix.

The other case is using poetry to point your semantics at an ad-hoc local checkout of Pyk to add logging or make changes to the underlying Pyk code when you're working:

kframework = { path = "/your/local/dir/k/pyk", develop = true }

If you're in a deeply nested context like Kontrol or Kasmer, this is annoying to do via Nix but trivial using a local build.

dwightguth commented 4 months ago

If you had to pass -tailcallopt to the code generator to get a tailcc function to emit a tail call when calling another musttail tailcc function, that is a bug in llvm and should be minimally reproduced in a .ll file and reported upstream to llvm. The fix you describe is probably fine as a workaround solution if it indeed solves the problem though

dwightguth commented 4 months ago

Alternatively, if the call site isn't musttail, maybe that's the problem?

Baltoli commented 4 months ago

The fix you describe is probably fine as a workaround solution if it indeed solves the problem though.

Yeah, I was aware when writing up that this fix isn't totally in the spirit of your TCO PR, but in the interests of getting my fix merged upstream I opted for the concise band-aid fix.

Alternatively, if the call site isn't musttail, maybe that's the problem?

Almost certainly so: this case would probably be pretty easy to track down the call site for and fix in the style of your PR. You'd probably want to also run through the codebase to see if there are any other places that morally should be musttail, but aren't.