m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
422 stars 193 forks source link

Kasli DMA sustained event rate #946

Open cjbe opened 6 years ago

cjbe commented 6 years ago

The sustained DMA event rate is surprisingly low on Kasli. Using the below experiment, I find that shortest pulse-delay time without underflow for a TTL output is:

For comparison, with the current KC705 gateware this is 128mu, and sb0 believes this should be closer to 48mu (3 clock cycles per event, https://irclog.whitequark.org/m-labs/2018-03-05)

(N.B. the RTIO clock for the DRTIO gateware is 150 MHz, vs 125 MHz for Opticlock)

Experiment:

class DMASaturate(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("core_dma")
        self.setattr_device("ttlo0")

        t_mu = self.get_argument("period", NumberValue(128))
        self.t_mu = np.int64(t_mu)

    @kernel
    def run(self):
        tp_mu = 8

        self.core.reset()

        with self.core_dma.record("ttl_local"):
            for _ in range(10000):
                self.ttlo0.pulse_mu(tp_mu)
                delay_mu(self.t_mu-tp_mu)

        h = self.core_dma.get_handle("ttl_local")

        self.core.break_realtime()
        for i in range(10):
            self.core_dma.playback_handle(h)
whitequark commented 6 years ago

Remote TTL is faster than local TTL?

cjbe commented 6 years ago

yes - remote is faster than local. I was surprised by this too, but verified that when there is no underflow I get the correct sequence (number of pulses on a counter) out of both the master and slave.

jordens commented 6 years ago
sbourdeauducq commented 6 years ago

That's due to the analyzer interfering (it is writing back to the memory the full DMA sequence, using IO bandwidth, causing bus arbitration delays, DRAM page cycles, etc.). With the analyzer disabled I get 207mu instead of ~1150mu. No need to modify gateware, disabling it in the firmware is sufficient:

--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -223,8 +223,8 @@ fn startup_ethernet() {
     io.spawn(16384, session::thread);
     #[cfg(any(has_rtio_moninj, has_drtio))]
     io.spawn(4096, moninj::thread);
-    #[cfg(has_rtio_analyzer)]
-    io.spawn(4096, analyzer::thread);
+    //#[cfg(has_rtio_analyzer)]
+    //io.spawn(4096, analyzer::thread);

     let mut net_stats = ethmac::EthernetStatistics::new();
     loop {
sbourdeauducq commented 6 years ago

The KC705 is less affected because the wider DRAM words make linear transfers (which is what the DMA core and the analyzer are doing) more efficient. We could reach similar efficiency on Kasli by implementing optional long bursts in the DRAM controller, and supporting them in the DMA and analyzer cores.

cjbe commented 6 years ago

@sbourdeauducq I don't see how this should make local and remote TTL transactions take different time - could you reproduce this aspect?

cjbe commented 6 years ago

Right - if I am reading the SDRAM core correctly, it is currently not buffering reads and writes, or optimising access patterns. So on Kasli during a DMA sequence, in worst case of DMA and analyser data in same bank:

So this broadly tallies with the opticlock 530ns/2 = 265ns per event = 33 cycles, but does not explain the ~1.1 us per event.

Whereas reading/write a whole row would take 2+6+125+2=135 cycles for 2KB = 111x 18 byte RTIO events, or just over 1 cycle per event. Hence without the RTIO analyser ~5 cycles per RTIO event taking into account the CRI write = 40ns Or just a cycle or two extra for the RTIO analyser writeback, assuming it is cached similarly.

So, depending on the effort required, it seems well worth implementing long bursts.

sbourdeauducq commented 6 years ago

Here are the results I got:

Here is what I propose:

hartytp commented 3 years ago

@pca006132 how is the DMA performance on Zynq? Does the ARM RAM controller give better performance?

pca006132 commented 3 years ago

@pca006132 how is the DMA performance on Zynq? Does the ARM RAM controller give better performance?

There are some debug code and cache flushing in the current artiq-zynq master. With those removed (and moving the cache flush to another location), we can get to 65mu.

Note that this is because the handle is reused every time. Cache flushing is a pretty expensive operation... So the time that would take to get the handle is not negligible.

Note: This is not using ACP as it is not finished yet, I expect a bit better performance with ACP. Edit: ACP would not be used for DMA due to low bandwidth.

hartytp commented 3 years ago

cool! That's a big step forwards. Is that with the analyzer enabled? I remember there being quite a long tail to the underflow distribution where we'd very occasionally find that sequences which would normally run with quite a bit of slack would underflow. If that's also reduced it would be wonderful...

pca006132 commented 3 years ago

cool! That's a big step forwards. Is that with the analyzer enabled? I remember there being quite a long tail to the underflow distribution where we'd very occasionally find that sequences which would normally run with quite a bit of slack would underflow. If that's also reduced it would be wonderful...

yes, analyzer is enabled, I could get some analyzer output:

OutputMessage(channel=4, timestamp=17094553753, rtio_counter=17094549496, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553761, rtio_counter=17094549528, address=0, data=0)
OutputMessage(channel=4, timestamp=17094553818, rtio_counter=17094549560, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553826, rtio_counter=17094549592, address=0, data=0)
OutputMessage(channel=4, timestamp=17094553883, rtio_counter=17094549624, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553891, rtio_counter=17094549656, address=0, data=0)
OutputMessage(channel=4, timestamp=17094553948, rtio_counter=17094549688, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553956, rtio_counter=17094549720, address=0, data=0)
OutputMessage(channel=4, timestamp=17094554013, rtio_counter=17094549752, address=0, data=1)
OutputMessage(channel=4, timestamp=17094554021, rtio_counter=17094549784, address=0, data=0)
OutputMessage(channel=4, timestamp=17094554078, rtio_counter=17094549816, address=0, data=1)
OutputMessage(channel=4, timestamp=17094554086, rtio_counter=17094549848, address=0, data=0)
OutputMessage(channel=4, timestamp=17094554143, rtio_counter=17094549880, address=0, data=1)
OutputMessage(channel=4, timestamp=17094554151, rtio_counter=17094549912, address=0, data=0)

So it should be working correctly I think.