Closed clayton-ho closed 1 week ago
It might be worth looking at a Wireshark capture to see if there is anything strange happening/to get another layer of timestamps to correlate. I wonder whether this is related to the network stack (e.g. Nagle/delayed ACK)…
@dnadlinger sorry for the late response, our experiment has been running so I haven't found the time to investigate it further, will try to investigate it this week
in the meantime, I had a friend try to replicate the issue on their v8 kasli-soc system (previous figures were for kasli v2.0) - results are attached.
Hardware involved:
Attached is a wireshark packet capture taken during an RPC profiling experiment (experiment file included), as well as the HDF result file (all zipped up).
Seems to be related to these changes in https://github.com/m-labs/artiq/compare/release-8...release-7 After applying these reverts the RPC latency went back to normal.
I'll investigate deeper to find out which change exactly caused this
Follow up: the c60de48a300a80c7ee9c3f09322510727f14d741 seems to cause this issue. After reapplying it to the above mentioned reverts the RPC latency has increased by around 40ms (with smoltcp v0.8 and related changes the latency increased). @mbirtwell @hartytp
I think @dnadlinger called it. The nagle algorithm was added to smoltcp between 0.6 and 0.8. I've disabled that on all sockets. I think this will only have affected RPC calls whilst there was an unacknowledged async RPC. You can see this affect in the data @clayton-ho posted because the first RPC you timed (before doing any async RPCs) was fast and all the others were slow.
implemented this fix on out system and RPC times are back to ~1ms!
thanks for all the help @dnadlinger @thomasfire @mbirtwell
Bug Report
One-Line Summary
Updated our system (i.e. flashed kasli firmware/gateware and upgraded the host software) from ARTIQ v7 to ARTIQ v8 and found that synchronous RPC times increased from a very consistent ~3ms on ARTIQ v7 to a very variable ~45ish ms.
Issue Details
Steps to Reproduce
Noticed it while running experiment code, but can be reproduced by profiling synchronous RPC times, e.g. using the following code snippet.
Expected Behavior
Ideally, the delays would be fairly close to ~3ms, or at least <10ms, since a typical pulse sequence is ~15ms for us.
Actual (undesired) Behavior
Profiled the hardware time overheads on both our ARTIQ v7 and ARTIQ v8 systems:
Tried playing around to try and localize the problem:
Your System (omit irrelevant parts)