m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
414 stars 192 forks source link

DRTIO aux packets do not use the whole space #2401

Closed Spaqin closed 1 month ago

Spaqin commented 2 months ago

By miscalculation/mistake earlier, larger, such as DDMA and subkernel data, DRTIO aux packets used 512 bytes at most in total, but the gateware specified max packet size is 1024. This is the main issue.

It does not seem to be a straighforward increase, however.

Applying 1024 to SAT_PAYLOAD_MAX_SIZE, on master, results in:

ld.lld: error: cargo/riscv32ima-unknown-none-elf/debug/libruntime.a(runtime-f1ba3fd692fe1184.runtime.6ydbbw3q-cgu.0.rcgu.o):(function runtime::mgmt::thread::h14d0c15d1c70fba1: .text._ZN7runtime4mgmt6thread17h14d0c15d1c70fba1E+0x1cc): relocation R_RISCV_JAL out of range: -722220 is not in [-524288, 524287]; references fringe::arch::imp::swap::trampoline::h1c873b4d27f29d20
>>> referenced by riscv32.rs:210 (/home/spaqin/.cargo/git/checkouts/libfringe-2757742fe32879a8/3ecbe53/src/arch/riscv32.rs:210)
>>> defined in cargo/riscv32ima-unknown-none-elf/debug/libruntime.a(runtime-f1ba3fd692fe1184.runtime.6ydbbw3q-cgu.0.rcgu.o)

ld.lld: error: cargo/riscv32ima-unknown-none-elf/debug/libruntime.a(runtime-f1ba3fd692fe1184.runtime.6ydbbw3q-cgu.0.rcgu.o):(function runtime::sched::Io::sleep::hd3e37a05b142c552: .text._ZN7runtime5sched2Io5sleep17hd3e37a05b142c552E+0x8c): relocation R_RISCV_JAL out of range: -692862 is not in [-524288, 524287]; references fringe::arch::imp::swap::trampoline::h1c873b4d27f29d20
>>> referenced by riscv32.rs:210 (/home/spaqin/.cargo/git/checkouts/libfringe-2757742fe32879a8/3ecbe53/src/arch/riscv32.rs:210)
>>> defined in cargo/riscv32ima-unknown-none-elf/debug/libruntime.a(runtime-f1ba3fd692fe1184.runtime.6ydbbw3q-cgu.0.rcgu.o)

ld.lld: error: cargo/riscv32ima-unknown-none-elf/debug/libruntime.a(runtime-f1ba3fd692fe1184.runtime.6ydbbw3q-cgu.0.rcgu.o):(function runtime::sched::Io::suspend::h55f56f083c1d858b: .text._ZN7runtime5sched2Io7suspend17h55f56f083c1d858bE+0x60): relocation R_RISCV_JAL out of range: -691490 is not in [-524288, 524287]; references fringe::arch::imp::swap::trampoline::h1c873b4d27f29d20
>>> referenced by riscv32.rs:210 (/home/spaqin/.cargo/git/checkouts/libfringe-2757742fe32879a8/3ecbe53/src/arch/riscv32.rs:210)
>>> defined in cargo/riscv32ima-unknown-none-elf/debug/libruntime.a(runtime-f1ba3fd692fe1184.runtime.6ydbbw3q-cgu.0.rcgu.o)
make[1]: *** [Makefile:24: runtime.elf] Error 1

That is some code gets shifted in a way that the linker cannot link them together with a JAL (relative jump) instruction.

Satman firmware builds just fine.

Experimentally, reducing the max packet size to 1024-310 gets it to compile fine again.

Spaqin commented 2 months ago

Updating LLVM to try to resolve the issue is not possible due to llvm-lite currently supporting only up to LLVM 14. Which is a bit of a shame - it seems that branch relaxation (replacing branching instructions if they're out of range) was fixed in 2023-02. That does point to that it could've been fixed by changing the linker (the discussion points out that binutils has that too), but does not solve the mystery of why/how the code grows so much with such change in the enum.

However, I tried Rust nightly from 2021-09-01 (fairly randomly, also tried one from 05-01 that did not work) and that seems to alleviate the issue for now. Of course that also brings Rust language feature changes and the code does not compile as-is, but requires a few changes. Yes, the chosen nightly version date is a bit arbitrary, but I'm not going with any newer as the number of required changes is still manageable, and there's less potential breaking points.

Still though, even on newer nightly, the max packet size that does not break LLVM linker is about 1.3K, if we wanted to increase that in future, we'd have to look into it further, or implement the aux packets in a different way, maybe Boxing the payloads.

Shall we (try to) migrate for now then?

thomasfire commented 2 months ago

llvm-lite currently supporting only up to LLVM 14

Probably you can try newer versions, they are not that precise in their docs about actually supported versions

sbourdeauducq commented 2 months ago

Shall we (try to) migrate for now then?

Yes.

Probably you can try newer versions, they are not that precise in their docs about actually supported versions

LLVM is an overrated hotmess generally.

In version 15 they made a major change in the API where you need to specify the type of each pointer at each dereference, so it is not surprising that they are stuck at version 14.

Spaqin commented 2 months ago

Making these changes on Zynq was quite simple and required minimal changes - ARM architecture may be not affected by short jump distance, or the linker for ARM may apply branch relaxation...

In general I got the code with Rust 2021-09-01 nightly to compile, but changes around unwind(allowed) and llvm_asm were necessary.

However, there seems to be some memory corruption, getting e.g. IllegalInstruction:

``` Trap frame: TrapFrame { ra: 4001119c, t0: feedfeed, t1: deaddead, t2: 42, t3: 4007b6c8, t4: 401cd234, t5: 401d1a58, t6: 401c68f0, a0: 40079b7c, a1: 20, a2: 4009f228, a3: 2, a4: 20, a5: feedfeed, a6: 60, a7: 54 } @ 0x000020 +0000: bb000000 44002211 ffffffff ffffffff +0010: 665599aa 00000020 01e00330 3b010000 +0020: 01800030 12000000 00000020 01200230 +0030: 00000000 01000230 00000000 01800030 panic at runtime/main.rs:335:13: exception IllegalInstruction at PC 0x20, trap value 0xbb000000 backtrace for software version 8.0+unknown.beta;tst-master: 0x4003ef18 0x400128e4 0x4003e490 halting. use `artiq_coremgmt config write -s panic_reset 1` to restart instead ``` (stack trace seems quite useless, only pointing to proto_artiq, if anything) or RPC claiming to receive something that shouldn't be sent (RPC code has not been touched, mind you): ``` panic at libproto_artiq/rpc_proto.rs:530:25: internal error: entered unreachable code backtrace for software version 8.0+unknown.beta;tst-master: 0x400410b8: /home/spaqin/m-labs/artiq/artiq/firmware/libunwind_backtrace/lib.rs:42 0x400122dc: runtime.3e687471-cgu.0:? 0x400122a4: runtime.3e687471-cgu.0:? 0x40017c64: /home/spaqin/m-labs/artiq/artiq/firmware/libproto_artiq/rpc_proto.rs:530 0x40049080: /home/spaqin/m-labs/artiq/artiq/firmware/libproto_artiq/rpc_proto.rs:358 0x4002a96c: /home/spaqin/m-labs/artiq/artiq/firmware/runtime/session.rs:8229 0x400140d8: /home/spaqin/.cargo/git/checkouts/libfringe-2757742fe32879a8/3ecbe53/src/arch/riscv32.rs:91 0x400140c8: /home/spaqin/.cargo/git/checkouts/libfringe-2757742fe32879a8/3ecbe53/src/arch/riscv32.rs:56 0x4003b7c8: /home/spaqin/.cargo/git/checkouts/libfringe-2757742fe32879a8/3ecbe53/src/arch/riscv32.rs:210 0x4000cf98: /home/spaqin/m-labs/artiq/artiq/firmware/libboard_misoc/riscv32/boot.rs:22 0x400220e0: /home/spaqin/m-labs/artiq/artiq/firmware/runtime/main.rs:269 0x40022088: /home/spaqin/m-labs/artiq/artiq/firmware/liblogger_artiq/lib.rs:67 0x40040264: /home/spaqin/m-labs/artiq/artiq/firmware/runtime/main.rs:268 ```

or OOM getting triggered, or other random weird panics, when a kernel is loaded.

Satellite compiles, but will not run subkernels, a LoadRequest ends with:

[    22.522823s]  INFO(satman::kernel): unexpected kernel CPU reply to load request: LoadRequest([])

That points that the kernel-comm CPU communication fails, memory corruption or some address mismatch

As a sidenote, I noticed that sometimes when I recompile, I get an error about alloc or core crates being overwritten with a different version. Clean compile will be fine, but the next attempt may throw such error.

``` error[E0460]: found possibly newer version of crate `core` which `byteorder` depends on --> libboard_misoc/lib.rs:4:1 | 4 | extern crate byteorder; | ^^^^^^^^^^^^^^^^^^^^^^^ | = note: perhaps that crate needs to be recompiled? = note: the following crate versions were found: crate `core`: /home/spaqin/m-labs/artiq/artiq_kasli/tst-sat-2.0/software/sysroot/lib/rustlib/riscv32ima-unknown-none-elf/lib/libcore-dfc8d87b7187c304.rmeta crate `core`: /home/spaqin/m-labs/artiq/artiq_kasli/tst-sat-2.0/software/sysroot/lib/rustlib/riscv32ima-unknown-none-elf/lib/libcore-dfc8d87b7187c304.rlib crate `byteorder`: /home/spaqin/m-labs/artiq/artiq_kasli/tst-sat-2.0/software/bootloader/cargo/riscv32ima-unknown-none-elf/debug/deps/libbyteorder-e51c1622f5206fb6.rmeta error: could not compile `board_misoc` due to previous error [...] # another compilation [...] error[E0460]: found possibly newer version of crate `alloc` which `board_artiq` depends on --> satman/main.rs:8:1 | 8 | extern crate board_artiq; | ^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: perhaps that crate needs to be recompiled? = note: the following crate versions were found: crate `alloc`: /home/spaqin/m-labs/artiq/artiq_kasli/tst-sat-2.0/software/sysroot/lib/rustlib/riscv32ima-unknown-none-elf/lib/liballoc-9ff73708ac3930d2.rmeta crate `alloc`: /home/spaqin/m-labs/artiq/artiq_kasli/tst-sat-2.0/software/sysroot/lib/rustlib/riscv32ima-unknown-none-elf/lib/liballoc-9ff73708ac3930d2.rlib crate `board_artiq`: /home/spaqin/m-labs/artiq/artiq_kasli/tst-sat-2.0/software/satman/cargo/riscv32ima-unknown-none-elf/debug/deps/libboard_artiq-a49deaca0939ecd6.rlib error: could not compile `satman` due to previous error ```

I do wonder if it's related - would a "wrong" alloc or core be used, causing corruption? I still don't know why it would have different versions.

Spaqin commented 2 months ago

Dropping xbuild and using cargo build -Z build-std=... helps - there's no more random version mismatching or memory corruption. Even more so, recompiling takes 10-20s rather than 50s that it does with xbuild. I also found few dependencies that have different versions for different packages (just with cargo tree -d) - that could be also cut down, but it's not the part of this issue.

With dropping xbuild, Rust nightlies up to 2021-03-04 compile fine, just by switching the version of the manifest, which is great, but does not solve the underlying issue of being able to generate code that would work with bigger packet size in a simple manner.

However, Rust nightly of 2021-03-05 introduces some changes that break the communication between comm and kernel CPUs, causing running experiments to panic or fail with:

[     7.598605s] ERROR(runtime::session): session aborted: unexpected request LoadRequest([112, 117, 108, 115, 101, 115]) from kernel CPU

The most suspicious change in question I believe to be switch from LLVM11 to LLVM12: https://github.com/rust-lang/rust/commit/409920873cf8a95739a55dc5fe5adb05e1b4758e

There was a bug reported breaking embedded that seemed relevant, but despite fixes for it, running an experiment still causes panics on later versions of Rust.

I believe there are signs of not memory corruption in general, but particularly on the channel between comm and kernel CPUs. Specifically, in my instance, it crashes when a debug! would print out the contents of the incoming packet.

I can't find any particularly hacky code that might've relied on compiler behavior, but maybe I'm not looking well enough. Will look deeper. Would be handy to be able to debug the kernel core better.