Closed Spaqin closed 1 month ago
CPU dcache must be flushed now after sending data between kernel and comm CPU
What is the performance impact of doing this?
What is the performance impact of doing this?
Doesn't seem to be significant - I assume that's because the data had to be moved from the cache into RAM anyway, just for some reason it stopped so with the new toolchain. Let's compare few tests that rely on communication:
Last main-beta
:
test_kernel_overhead (...) ... 0.09693022783991183 s
test_async_throughput (...) ... Async throughput: 2.85MiB/s
| Test | Mean (MiB/s) | std (MiB/s) |
| -------------------- | ------------ | ------------ |
| I32 Array (1MB) H2D | 2.60 | 0.00 |
| I32 Array (1MB) D2H | 2.54 | 0.03 |
| I32 Array (1KB) H2D | 0.74 | 0.00 |
| I32 Array (1KB) D2H | 0.85 | 0.00 |
| Bytes List (1MB) H2D | 2.58 | 0.00 |
| Bytes List (1MB) D2H | 2.51 | 0.03 |
| Bytes List (1KB) H2D | 0.72 | 0.00 |
| Bytes List (1KB) D2H | 0.83 | 0.00 |
| Bytes (1MB) H2D | 2.61 | 0.00 |
| Bytes (1MB) D2H | 2.55 | 0.03 |
| Bytes (1KB) H2D | 0.73 | 0.00 |
| Bytes (1KB) D2H | 0.84 | 0.00 |
| I32 List (1MB) H2D | 2.59 | 0.00 |
| I32 List (1MB) D2H | 2.54 | 0.03 |
| I32 List (1KB) H2D | 0.72 | 0.00 |
| I32 List (1KB) D2H | 0.84 | 0.00 |
test_dma_playback_time (...) ... dt=0.080111216, dt/count=4.0055608e-06
test_dma_record_time (...) ... dt=0.11614924800000001, dt/count=5.807462400000001e-06
test_rpc_timing (...) ... 0.0008762152000000001
and HITL tests I ran for this code:
test_kernel_overhead (...) ... 0.10849368147999484 s
test_async_throughput (...) ... Async throughput: 3.09MiB/s
| Test | Mean (MiB/s) | std (MiB/s) |
| -------------------- | ------------ | ------------ |
| I32 Array (1MB) H2D | 2.88 | 0.00 |
| I32 Array (1MB) D2H | 2.94 | 0.16 |
| I32 Array (1KB) H2D | 0.71 | 0.00 |
| I32 Array (1KB) D2H | 0.80 | 0.00 |
| Bytes List (1MB) H2D | 2.83 | 0.01 |
| Bytes List (1MB) D2H | 2.84 | 0.09 |
| Bytes List (1KB) H2D | 0.71 | 0.00 |
| Bytes List (1KB) D2H | 0.81 | 0.00 |
| Bytes (1MB) H2D | 2.88 | 0.00 |
| Bytes (1MB) D2H | 2.94 | 0.16 |
| Bytes (1KB) H2D | 0.72 | 0.03 |
| Bytes (1KB) D2H | 0.84 | 0.05 |
| I32 List (1MB) H2D | 2.85 | 0.00 |
| I32 List (1MB) D2H | 2.92 | 0.01 |
| I32 List (1KB) H2D | 0.71 | 0.03 |
| I32 List (1KB) D2H | 0.82 | 0.04 |
test_dma_playback_time (...) ... dt=0.080110424, dt/count=4.0055212e-06
test_dma_record_time (...) ... dt=0.11445123200000001, dt/count=5.7225616000000005e-06
test_rpc_timing (...) ... 0.0009223624000000005
Even if I remove the flush from acknowledge()
(without which the test suite may crash, but with a lesser chance), the benchmark numbers do not change.
Had to update the Cargo.lock for Nix to work with the Rust toolchain update. I can only warn anyone who will have to update it further - it will be painful.
Turns out that invalidating the cache in itself is not necessary. I replaced that part with an empty block asm!("", options(preserves_flags, readonly, nostack));
. I can only assume that the new LLVM12 compiler took some liberties in order of writes and execution, unaware that memory is shared by another CPU, this asm block would act as a fence.
Additionally, I cut down on mailbox::receive()
calls (on comm, cache would get invalidated thrice per message received, unnecessarily...). Still though, looking at HITL tests, the performance doesn't seem to be affected too much compared to previous with cache invalidation, most numbers haven't changed much.
ARTIQ Pull Request
Description of Changes
By mistake before, I assumed that maximum DRTIO aux packet size was 512 bytes, when in reality the gateware limit was 1k. This is important for large packets sent for subkernels and DDMA - increasing bandwidth by large.
However, while it may seem like a one line change (maybe two or three with necessary thread stack increases), that simple change has hit the limit of the linker, which on RISC-V cannot do branch relaxation (see the related issue for more information).
As replacing the linker was more difficult than I thought, I found that newer Rust nightly produces code that still fits within the branching limits. That in turn required changing the code a bit:
xbuild
was removed in favor ofcargo build -Z build-std=...
, as xbuild's core crates would conflict and cause memory issues. That also had a side effect of shortening recompilation time - software only would take 1 minute on my computer, now it takes less than 20 seconds.llvm_asm!
replaced withasm!
with different syntax,#[unwind(allowed)]
features being replaced withextern "C-unwind"
,NoneError
being removed, so satman subkernel code was slightly refactored,It requires Misoc update for llvm_asm and xbuild -> build changes.
To make sure that there are no further bugs brought by new Rust toolchain, I ran kc705 HITL tests successfully, and tested DDMA and subkernels with two Kaslis 2.0.
Related Issue
Closes #2401
Type of Changes
Steps (Choose relevant, delete irrelevant before submitting)
All Pull Requests
Code Changes
Git Logistics
git rebase --interactive
). Merge/squash/fixup commits that just fix or amend previous commits. Remove unintended changes & cleanup. See tutorial.[x] Write short & meaningful commit messages. Review each commit for messages (
git show
). Format:Licensing
See copyright & licensing for more info. ARTIQ files that do not contain a license header are copyrighted by M-Labs Limited and are licensed under LGPLv3+.