Spaqin commented 1 month ago

ARTIQ Pull Request

Description of Changes

By mistake before, I assumed that maximum DRTIO aux packet size was 512 bytes, when in reality the gateware limit was 1k. This is important for large packets sent for subkernels and DDMA - increasing bandwidth by large.

However, while it may seem like a one line change (maybe two or three with necessary thread stack increases), that simple change has hit the limit of the linker, which on RISC-V cannot do branch relaxation (see the related issue for more information).

As replacing the linker was more difficult than I thought, I found that newer Rust nightly produces code that still fits within the branching limits. That in turn required changing the code a bit:

xbuild was removed in favor of cargo build -Z build-std=..., as xbuild's core crates would conflict and cause memory issues. That also had a side effect of shortening recompilation time - software only would take 1 minute on my computer, now it takes less than 20 seconds.
Features becoming stable: const in arrays (for exceptions), have to be declared separately,
llvm_asm! replaced with asm! with different syntax,
#[unwind(allowed)] features being replaced with extern "C-unwind",
NoneError being removed, so satman subkernel code was slightly refactored,
for some reason after the toolchain update, CPU dcache must be flushed now after sending data between kernel and comm CPU. This took me a long time to figure out.

It requires Misoc update for llvm_asm and xbuild -> build changes.

To make sure that there are no further bugs brought by new Rust toolchain, I ran kc705 HITL tests successfully, and tested DDMA and subkernels with two Kaslis 2.0.

Related Issue

Closes #2401

Type of Changes

	Type
✓	:bug: Bug fix
✓	:hammer: Refactoring

Steps (Choose relevant, delete irrelevant before submitting)

All Pull Requests

[x] Use correct spelling and grammar.

Code Changes

[x] Test your changes or have someone test them. Mention what was tested and how.

Git Logistics

[x] Split your contribution into logically separate changes (git rebase --interactive). Merge/squash/fixup commits that just fix or amend previous commits. Remove unintended changes & cleanup. See tutorial.
[x] Write short & meaningful commit messages. Review each commit for messages (git show). Format:
```
topic: description. < 50 characters total.

Longer description. < 70 characters per line
```

Licensing

See copyright & licensing for more info. ARTIQ files that do not contain a license header are copyrighted by M-Labs Limited and are licensed under LGPLv3+.

sbourdeauducq commented 1 month ago

CPU dcache must be flushed now after sending data between kernel and comm CPU

What is the performance impact of doing this?

Spaqin commented 1 month ago

What is the performance impact of doing this?

Doesn't seem to be significant - I assume that's because the data had to be moved from the cache into RAM anyway, just for some reason it stopped so with the new toolchain. Let's compare few tests that rely on communication:

Last main-beta:

test_kernel_overhead (...) ... 0.09693022783991183 s
test_async_throughput (...)  ... Async throughput:   2.85MiB/s
| Test                 | Mean (MiB/s) |  std (MiB/s) |
| -------------------- | ------------ | ------------ |
| I32 Array (1MB) H2D  |         2.60 |         0.00 |
| I32 Array (1MB) D2H  |         2.54 |         0.03 |
| I32 Array (1KB) H2D  |         0.74 |         0.00 |
| I32 Array (1KB) D2H  |         0.85 |         0.00 |
| Bytes List (1MB) H2D |         2.58 |         0.00 |
| Bytes List (1MB) D2H |         2.51 |         0.03 |
| Bytes List (1KB) H2D |         0.72 |         0.00 |
| Bytes List (1KB) D2H |         0.83 |         0.00 |
| Bytes (1MB) H2D      |         2.61 |         0.00 |
| Bytes (1MB) D2H      |         2.55 |         0.03 |
| Bytes (1KB) H2D      |         0.73 |         0.00 |
| Bytes (1KB) D2H      |         0.84 |         0.00 |
| I32 List (1MB) H2D   |         2.59 |         0.00 |
| I32 List (1MB) D2H   |         2.54 |         0.03 |
| I32 List (1KB) H2D   |         0.72 |         0.00 |
| I32 List (1KB) D2H   |         0.84 |         0.00 |
test_dma_playback_time (...) ... dt=0.080111216, dt/count=4.0055608e-06
test_dma_record_time (...) ... dt=0.11614924800000001, dt/count=5.807462400000001e-06
test_rpc_timing (...) ... 0.0008762152000000001

and HITL tests I ran for this code:

test_kernel_overhead (...) ... 0.10849368147999484 s
test_async_throughput (...) ... Async throughput:   3.09MiB/s
| Test                 | Mean (MiB/s) |  std (MiB/s) |
| -------------------- | ------------ | ------------ |
| I32 Array (1MB) H2D  |         2.88 |         0.00 |
| I32 Array (1MB) D2H  |         2.94 |         0.16 |
| I32 Array (1KB) H2D  |         0.71 |         0.00 |
| I32 Array (1KB) D2H  |         0.80 |         0.00 |
| Bytes List (1MB) H2D |         2.83 |         0.01 |
| Bytes List (1MB) D2H |         2.84 |         0.09 |
| Bytes List (1KB) H2D |         0.71 |         0.00 |
| Bytes List (1KB) D2H |         0.81 |         0.00 |
| Bytes (1MB) H2D      |         2.88 |         0.00 |
| Bytes (1MB) D2H      |         2.94 |         0.16 |
| Bytes (1KB) H2D      |         0.72 |         0.03 |
| Bytes (1KB) D2H      |         0.84 |         0.05 |
| I32 List (1MB) H2D   |         2.85 |         0.00 |
| I32 List (1MB) D2H   |         2.92 |         0.01 |
| I32 List (1KB) H2D   |         0.71 |         0.03 |
| I32 List (1KB) D2H   |         0.82 |         0.04 |
test_dma_playback_time (...) ... dt=0.080110424, dt/count=4.0055212e-06
test_dma_record_time (...) ... dt=0.11445123200000001, dt/count=5.7225616000000005e-06
test_rpc_timing (...) ... 0.0009223624000000005

Even if I remove the flush from acknowledge() (without which the test suite may crash, but with a lesser chance), the benchmark numbers do not change.

Spaqin commented 1 month ago

Had to update the Cargo.lock for Nix to work with the Rust toolchain update. I can only warn anyone who will have to update it further - it will be painful.

Turns out that invalidating the cache in itself is not necessary. I replaced that part with an empty block asm!("", options(preserves_flags, readonly, nostack)); . I can only assume that the new LLVM12 compiler took some liberties in order of writes and execution, unaware that memory is shared by another CPU, this asm block would act as a fence.

Additionally, I cut down on mailbox::receive() calls (on comm, cache would get invalidated thrice per message received, unnecessarily...). Still though, looking at HITL tests, the performance doesn't seem to be affected too much compared to previous with cache invalidation, most numbers haven't changed much.

m-labs / artiq

DRTIO aux: use full size of aux packet #2404

ARTIQ Pull Request

Description of Changes

Related Issue

Type of Changes

Steps (Choose relevant, delete irrelevant before submitting)

All Pull Requests

Code Changes

Git Logistics

Licensing