m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
437 stars 201 forks source link

Increase in synchronous RPC times from ~3ms (ARTIQ v7) to ~40ms (ARTIQ v8) #2601

Closed clayton-ho closed 1 week ago

clayton-ho commented 1 month ago

Bug Report

One-Line Summary

Updated our system (i.e. flashed kasli firmware/gateware and upgraded the host software) from ARTIQ v7 to ARTIQ v8 and found that synchronous RPC times increased from a very consistent ~3ms on ARTIQ v7 to a very variable ~45ish ms.

Issue Details

Steps to Reproduce

Noticed it while running experiment code, but can be reproduced by profiling synchronous RPC times, e.g. using the following code snippet.

@kernel
def run(self):
    self.core.reset()

    for i in range(self.repetitions):
        self.core.break_realtime()

        # simulate pulse sequence
        self.urukul2_ch1.sw.on()
        delay_mu(self.time_delay_mu)
        self.urukul2_ch1.sw.off()

        # RPC call
        time_start_mu = self.core.get_rtio_counter_mu()
        self.rpc_call()
        time_stop_mu = self.core.get_rtio_counter_mu()
        self.update_results(time_stop_mu - time_start_mu)

@rpc
def rpc_call(self) -> TNone:
    return

Expected Behavior

Ideally, the delays would be fairly close to ~3ms, or at least <10ms, since a typical pulse sequence is ~15ms for us.

Actual (undesired) Behavior

Profiled the hardware time overheads on both our ARTIQ v7 and ARTIQ v8 systems:

image

Tried playing around to try and localize the problem:

Your System (omit irrelevant parts)

dnadlinger commented 1 month ago

It might be worth looking at a Wireshark capture to see if there is anything strange happening/to get another layer of timestamps to correlate. I wonder whether this is related to the network stack (e.g. Nagle/delayed ACK)…

clayton-ho commented 2 weeks ago

@dnadlinger sorry for the late response, our experiment has been running so I haven't found the time to investigate it further, will try to investigate it this week

in the meantime, I had a friend try to replicate the issue on their v8 kasli-soc system (previous figures were for kasli v2.0) - results are attached.

Hardware involved:

image
clayton-ho commented 2 weeks ago

Attached is a wireshark packet capture taken during an RPC profiling experiment (experiment file included), as well as the HDF result file (all zipped up).

rpc_profiling_2024_11_09.zip

Your System (omit irrelevant parts)

Hardware involved:

thomasfire commented 1 week ago

Seems to be related to these changes in https://github.com/m-labs/artiq/compare/release-8...release-7 image After applying these reverts the RPC latency went back to normal.

I'll investigate deeper to find out which change exactly caused this

thomasfire commented 1 week ago

Follow up: the c60de48a300a80c7ee9c3f09322510727f14d741 seems to cause this issue. After reapplying it to the above mentioned reverts the RPC latency has increased by around 40ms (with smoltcp v0.8 and related changes the latency increased). @mbirtwell @hartytp

mbirtwell commented 1 week ago

I think @dnadlinger called it. The nagle algorithm was added to smoltcp between 0.6 and 0.8. I've disabled that on all sockets. I think this will only have affected RPC calls whilst there was an unacknowledged async RPC. You can see this affect in the data @clayton-ho posted because the first RPC you timed (before doing any async RPCs) was fast and all the others were slow.

clayton-ho commented 1 week ago

implemented this fix on out system and RPC times are back to ~1ms!

thanks for all the help @dnadlinger @thomasfire @mbirtwell