m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
414 stars 192 forks source link

DRTIO aux: use full size of aux packet #2404

Closed Spaqin closed 1 month ago

Spaqin commented 1 month ago

ARTIQ Pull Request

Description of Changes

By mistake before, I assumed that maximum DRTIO aux packet size was 512 bytes, when in reality the gateware limit was 1k. This is important for large packets sent for subkernels and DDMA - increasing bandwidth by large.

However, while it may seem like a one line change (maybe two or three with necessary thread stack increases), that simple change has hit the limit of the linker, which on RISC-V cannot do branch relaxation (see the related issue for more information).

As replacing the linker was more difficult than I thought, I found that newer Rust nightly produces code that still fits within the branching limits. That in turn required changing the code a bit:

It requires Misoc update for llvm_asm and xbuild -> build changes.

To make sure that there are no further bugs brought by new Rust toolchain, I ran kc705 HITL tests successfully, and tested DDMA and subkernels with two Kaslis 2.0.

Related Issue

Closes #2401

Type of Changes

Type
:bug: Bug fix
:hammer: Refactoring

Steps (Choose relevant, delete irrelevant before submitting)

All Pull Requests

Code Changes

Git Logistics

Licensing

See copyright & licensing for more info. ARTIQ files that do not contain a license header are copyrighted by M-Labs Limited and are licensed under LGPLv3+.

sbourdeauducq commented 1 month ago

CPU dcache must be flushed now after sending data between kernel and comm CPU

What is the performance impact of doing this?

Spaqin commented 1 month ago

What is the performance impact of doing this?

Doesn't seem to be significant - I assume that's because the data had to be moved from the cache into RAM anyway, just for some reason it stopped so with the new toolchain. Let's compare few tests that rely on communication:

Last main-beta:

test_kernel_overhead (...) ... 0.09693022783991183 s
test_async_throughput (...)  ... Async throughput:   2.85MiB/s
| Test                 | Mean (MiB/s) |  std (MiB/s) |
| -------------------- | ------------ | ------------ |
| I32 Array (1MB) H2D  |         2.60 |         0.00 |
| I32 Array (1MB) D2H  |         2.54 |         0.03 |
| I32 Array (1KB) H2D  |         0.74 |         0.00 |
| I32 Array (1KB) D2H  |         0.85 |         0.00 |
| Bytes List (1MB) H2D |         2.58 |         0.00 |
| Bytes List (1MB) D2H |         2.51 |         0.03 |
| Bytes List (1KB) H2D |         0.72 |         0.00 |
| Bytes List (1KB) D2H |         0.83 |         0.00 |
| Bytes (1MB) H2D      |         2.61 |         0.00 |
| Bytes (1MB) D2H      |         2.55 |         0.03 |
| Bytes (1KB) H2D      |         0.73 |         0.00 |
| Bytes (1KB) D2H      |         0.84 |         0.00 |
| I32 List (1MB) H2D   |         2.59 |         0.00 |
| I32 List (1MB) D2H   |         2.54 |         0.03 |
| I32 List (1KB) H2D   |         0.72 |         0.00 |
| I32 List (1KB) D2H   |         0.84 |         0.00 |
test_dma_playback_time (...) ... dt=0.080111216, dt/count=4.0055608e-06
test_dma_record_time (...) ... dt=0.11614924800000001, dt/count=5.807462400000001e-06
test_rpc_timing (...) ... 0.0008762152000000001

and HITL tests I ran for this code:

test_kernel_overhead (...) ... 0.10849368147999484 s
test_async_throughput (...) ... Async throughput:   3.09MiB/s
| Test                 | Mean (MiB/s) |  std (MiB/s) |
| -------------------- | ------------ | ------------ |
| I32 Array (1MB) H2D  |         2.88 |         0.00 |
| I32 Array (1MB) D2H  |         2.94 |         0.16 |
| I32 Array (1KB) H2D  |         0.71 |         0.00 |
| I32 Array (1KB) D2H  |         0.80 |         0.00 |
| Bytes List (1MB) H2D |         2.83 |         0.01 |
| Bytes List (1MB) D2H |         2.84 |         0.09 |
| Bytes List (1KB) H2D |         0.71 |         0.00 |
| Bytes List (1KB) D2H |         0.81 |         0.00 |
| Bytes (1MB) H2D      |         2.88 |         0.00 |
| Bytes (1MB) D2H      |         2.94 |         0.16 |
| Bytes (1KB) H2D      |         0.72 |         0.03 |
| Bytes (1KB) D2H      |         0.84 |         0.05 |
| I32 List (1MB) H2D   |         2.85 |         0.00 |
| I32 List (1MB) D2H   |         2.92 |         0.01 |
| I32 List (1KB) H2D   |         0.71 |         0.03 |
| I32 List (1KB) D2H   |         0.82 |         0.04 |
test_dma_playback_time (...) ... dt=0.080110424, dt/count=4.0055212e-06
test_dma_record_time (...) ... dt=0.11445123200000001, dt/count=5.7225616000000005e-06
test_rpc_timing (...) ... 0.0009223624000000005

Even if I remove the flush from acknowledge() (without which the test suite may crash, but with a lesser chance), the benchmark numbers do not change.

Spaqin commented 1 month ago

Had to update the Cargo.lock for Nix to work with the Rust toolchain update. I can only warn anyone who will have to update it further - it will be painful.

Turns out that invalidating the cache in itself is not necessary. I replaced that part with an empty block asm!("", options(preserves_flags, readonly, nostack)); . I can only assume that the new LLVM12 compiler took some liberties in order of writes and execution, unaware that memory is shared by another CPU, this asm block would act as a fence.

Additionally, I cut down on mailbox::receive() calls (on comm, cache would get invalidated thrice per message received, unnecessarily...). Still though, looking at HITL tests, the performance doesn't seem to be affected too much compared to previous with cache invalidation, most numbers haven't changed much.