m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
436 stars 201 forks source link

DMA and wide RTIO issues #1521

Closed pathfinder49 closed 1 year ago

pathfinder49 commented 4 years ago

Bug Report

One-Line Summary

The wide interface for Fastino shows several unexpected behaviors that make it unusable when combined with DMA.

Issue Details

Edit: These issues appear to arise from combining the wide interface and DMA The 32 channel wide API for Fastino is broken.

  1. When attempting to write 1000 updates to all channels at 2.55 MS/s using DMA, code can execute with no errors. However, after a short duration, updates stop being written to the hardware. No RTIO error is rasied when this occurs.
  2. Reducing the length of the DMA to 100 points results in RTIOUnderflow
  3. Using 1000 updates at a reduced update rate of 1.3 MS/s gives the same behavior as in 1.
  4. Using 100 updates at reduced update rate of 1.3 MS/s allows continuous updates to be written to the DAC. However, the resulting waveform does not match the expected waveform. This waveform should produce a saw-tooth with 100 voltage steps. However, there are only 20 steps. The period of the saw-tooth is ~78 us (as intended).
  5. Reducing DMA length to 10 updates at 1.3 MS/s results in RTIODestinationUnreachable errors after a few hardware updates.
  6. Reducing the update rate further results in the same behavior as points 3-5. However, the frequency of the saw-tooth produced in 4. remains unchanged. Both the frequency and voltage step number do not respond to update rate changes.

Steps to Reproduce

  1. Build Kasli gateware with log2_width parameter in artiq.gateware.eem.Fastino set to 5.
  2. Setup device_db with matching log2_width
  3. Run the following test experiment:
    
    from artiq.experiment import *

class DC_DMA(EnvExperiment): def build(self): self.setattr_device("core") self.setattr_device("core_dma") self.setattr_device("fastino_0") print("build")

def run(self):
    print("run")
    self.vlist = [0.0] * 32
    self.do()

@kernel
def do(self):
    self.core.reset()
    print("0")
    f = self.fastino_0
    self.record(f)
    handle = self.core_dma.get_handle("sawtooth")
    print("1")
    self.core.break_realtime()
    f.init()
    delay(1*us)
    f.update(0xffffffff)
    delay(1*us)

    for i in range(10):
        f.set_leds(0xaa)
        delay(.1*s)
        f.set_leds(0x55)
        delay(.1*s)

    print("start DMA")
    self.core.break_realtime()
    while True:
        self.core_dma.playback_handle(handle)

    self.core.wait_until_mu(now_mu())

@kernel
def record(self, f):
    with self.core_dma.record("sawtooth"):
        k0 = 14*7//2*8*2
        # k0 = k0 // 2  # uncomment for 2.55 MS
        # k0 * (1<<4)  # substantially reduce update rate
        n = 100  # length of DMA
        for i in range(n):
            for ii in range(32):
                self.vlist[ii] = -10.0 + i * 20.0/n
            f.set_group(0, self.vlist)
            delay_mu(k0) 


### Expected Behavior

*  RTIO failures should *always* be reported!
*  The wide interface should allow 2.55 MS updates
*  The written waveform should be returned
*  There should be no dependence on DMA length (barring memory size/speed)

### Your System (omit irrelevant parts)

* Operating System: Windows
* Version of the gateware and runtime loaded in the core device:
![image](https://user-images.githubusercontent.com/14295481/93472505-a60c0f80-f8ec-11ea-9e0d-890132e034e0.png)

* Hardware involved: Kasli 1.1 (speed grade 3) and Fastino 1.1

<!--
For in-depth information on bug reporting, see:

http://www.chiark.greenend.org.uk/~sgtatham/bugs.html https://developer.mozilla.org/en-US/docs/Mozilla/QA/Bug_writing_guidelines
-->
jordens commented 4 years ago

Let's dial it down a bit. You are making a few assumptions that are not true and others need to be checked first.

The wide interface should allow 2.55 MS updates

That would be nice but I guess you are aware of the limitations and uncertainties of DMA, wide RTIO, and DRTIO. Does it work without DMA for a few (let's say 16 each channel) samples? If yes, then there is no evidence that the Fastino interface is in fact broken as you claim and you need to look at DMA/DRTIO. Also the core analyzer may be very helpful here (unless you turned it off).

There should be no dependence on DMA length (barring memory size/speed)

It's likely that that has got nothing to do with Fastino.

The written waveform should be returned

That depends on the other assumptions being true.

RTIO failures should always be reported! RTIODestinationUnreachable

You may well be two different issues that have little to do with Fastino: DMA and DRTIO for wide events. I'd recommend isolating the three.

Also double check your code and wnsure that it's consistent and clean. The git hash on your master that you appear to be using (816a6f2ec7) is early 2018. Other things indicate newer code. Might also be a build system issue.

pathfinder49 commented 4 years ago

The problem persists with out using DMA. In fact, without DMA, I can't seem to get any continuous playback. My code is below. Please let me know if I misunderstand how to use the API.

With the update interval below, the code underflows. Increasing the delay by a factor 2, results in no errors and no hardware updates. Edit: This works fine, there was a silly mistake with mutable copies.

The gateware was built based of our local artiq variant. I will rebuild based off the latest version here (with the nessecary modification to gateware.eem.Fastino). Edit: the behavior is unchanged with the new build

from artiq.experiment import *

class Saw(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("fastino_0")
        print("build")

    def run(self):
        print("run")
        self.n = 10
        self.vlist = [[0.]*32] * self.n
        for i in range(self.n):
            for ii in range(32):
                self.vlist[i][ii] = -10.0 +i * 20.0/self.n
        self.do()

    @kernel
    def do(self):
        self.core.reset()
        f = self.fastino_0
        self.core.break_realtime()
        f.init()
        delay(1*us)
        f.update(0xffffffff)
        delay(1*us)

        for i in range(10):
            f.set_leds(0xaa)
            delay(.1*s)
            f.set_leds(0x55)
            delay(.1*s)

        print("start playback")
        self.core.break_realtime()
        delay(0.5)
        while True:
            self.play()
        self.core.wait_until_mu(now_mu())

    @kernel
    def play(self):
        k0 = 14*7//2*8*2
        k0 = k0 *(1<<10)
        for i in range(self.n):
            self.fastino_0.set_group(0, self.vlist[i])
            delay_mu(k0) 
jordens commented 4 years ago

Yes. Without DMA the sustained rate is limited by the CPU speed. Just test bursts (the ~16 samples I mentioned or whatever fits into an RTIO SED buffer, maybe up to 512?) with plenty of slack (i.e. add 100 ms of slack, then push the samples, then loop) to verify the Fastino interface.

pathfinder49 commented 4 years ago

Removing the loop and using 10 updates gives the following error: https://pastebin.pl/view/79e186da

In this scenario the hardware doesn't update at all. Edit: this was run with gate-ware derived from the most recent commit.

dnadlinger commented 4 years ago

You've probably neglected to update ARTIQ on the master as well – the RPC protocol recently changed.

pathfinder49 commented 4 years ago

That was indeed the cause of the error. With the host updated, the error no-longer occurs. There are now no errors and still no correct hardware playback.

jordens commented 4 years ago
self.vlist = [[0.]*32] * self.n

doesn't do what you think it does.

pathfinder49 commented 4 years ago

Indeed, it does not :sweat_smile: Fixed by converting to a list comprehension. The wide interface operates fine at 1.3 MS/s with up-to 500 updates (no DMA).

jordens commented 4 years ago

And 2.55 MS/s as well, right? And also the wide interface is not special to fastino. I don't know whether wide rtio was really tested with dma. This is either an intrinsic limitation or bug in dma/drtio.

pathfinder49 commented 4 years ago

writing a single burst at 2.55 MS/s using the wide interface and DMA, does update the DAC. However, it does not play back the correct waveform. If I write a 100 sample ramp, the ramp appears to be down sampled to a ~20 sample ramp.

jordens commented 4 years ago

Please elaborate and be thorough and precise. Give yourself some time. I'd have to guess what you mean by "appears to" and "down sampled".

pathfinder49 commented 4 years ago

To be precise, the code below produces the output shown in the table.

The ramp exhibits 19 distinct voltage levels. These are similarly, but not equally spaced. This is not the 100 sample ramp that was specified in the code, nor do the updates occur at 2.5 MS/s.

100 sample, 2.5 MS/s linear full scale ramp same ramp zoomed in on a voltage step
NewFile0 NewFile1
from artiq.experiment import *

class Saw_DMA(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("core_dma")
        self.setattr_device("fastino_0")
        print("build")

    def run(self):
        print("run")
        self.vlist = [0.0] * 32
        self.do()

    @kernel
    def do(self):
        self.core.reset()
        print("0")
        f = self.fastino_0
        self.record(f)
        handle = self.core_dma.get_handle("sawtooth")
        print("1")
        self.core.break_realtime()
        f.init()
        delay(1*us)
        f.update(0xffffffff)

        print("start DMA")
        self.core.break_realtime()
        self.core_dma.playback_handle(handle)

        self.core.wait_until_mu(now_mu())

    @kernel
    def record(self, f):
        with self.core_dma.record("sawtooth"):
            k0 = 14*7//2*8*2
            k0 = k0 // 2  # uncomment for 2.55 MS
            n = 100  # length of DMA
            for i in range(n):
                for ii in range(32):
                    self.vlist[ii] = -10.0 + i * 20.0/n
                f.set_group(0, self.vlist)
                delay_mu(k0) 
pathfinder49 commented 4 years ago

Running the code above with n=10 results in the experiment not terminating.

jordens commented 4 years ago

I thought we agreed to test without dma.

pathfinder49 commented 4 years ago

writing a single burst at 2.55 MS/s using the wide interface and DMA

I clearly stated I was using DMA. Given the issue name, testing without DMA does seem a bit stange.

hartytp commented 4 years ago

@pathfinder49 there is clearly an issue here. The question is where. Is it in the fastino wide interface, or the rtio/dma infrastructure when using the wide interface.

A good way of narrowing down the problem is taking a variable out of the equation — in this case by eliminating DMA. So the question is: do you see similar behaviour if you don’t use DMA (to the extent you can reproduce this given the finite sustained event rate/fifo size).

pathfinder49 commented 4 years ago

There is clearly a DMA issue. DMA does not work correctly at 1.3 MS/s. Non-DMA writes work correctly at 1.3 MS/s. I can attempt 2.55 MS without DMA. However, this is a different issue to the DMA incompatibility.

cjbe commented 4 years ago

@pathfinder49 can you post the output traces for a linear ramp with and without DMA at 1.3 MS/s, so we can directly illustrate this?

pathfinder49 commented 4 years ago

Below are 100 update, 2.55 MS/s, linear full scale ramps with and without DMA

With DMA Without DMA
NewFile4 NewFile3
hartytp commented 4 years ago

What’s the conclusion then? DMA is currently broken with wide RTIO in general?

hartytp commented 4 years ago

@pathfinder49 ignoring the breakage, can you summarise what we can achieve in terms of event rates for wide/narrow interfaces with / without DMA?

jordens commented 4 years ago

Please verify that you can reproduce working max rate bursts with Fastino (again, without DMA and DRTIO). That's the first checkpoint I'm looking for and its a requirement for everything else to work. After that you can look at DMA and DRTIO in any order but individually first.

As I mentioned explicitly before, you should dramatically reduce your problem size. Please take the advice, otherwise this will be painful. As a small example: A DRTIO link can currently at best sustain 1 Gb/s. The Fastino data at 32 channels and 2.55 MS/s is already more than 1.3 Gb/s and that's still without any overhead or round trips. I don't know even what the max practical throughput is for DRTIO and what it is for wide RTIO events. And I don't know what it is for DMA.

pathfinder49 commented 4 years ago

No data in this issue uses DRTIO. Please explain what further verification of the max burst rate performace you want. The plot above shows it working correctly at 2.55 MS/s without DMA.

pathfinder49 commented 4 years ago

@pathfinder49 ignoring the breakage, can you summarise what we can achieve in terms of event rates for wide/narrow interfaces with / without DMA?

Wide interface

jordens commented 4 years ago

Your screenshot above shows DRTIO being used and @dnadlinger 's comment about the master version points to it as well. You haven't given much information about your setup (we have the issue templates requesting that) so I have to speculate. Yes, the plot shows that it apparently works for you and it probably takes Fastino out of the problem. But it took three requests.

For DMA with wide RTIO or overall DMA throughput maybe @sbourdeauducq can shed some light, or @cjbe

dnadlinger commented 4 years ago

Not to add to the confusion (I certainly don't have to add anything to the debugging effort), but I was referring to master as in the artiq_master process, not as in the DRTIO node.

hartytp commented 4 years ago

Summary of discussion on IRC today: the tests we've already done seem to exclude this being an issue that's directly related to Fastino. Instead, it seems to be Fastino uncovering a bug in ARTIQ DMA, wide RTIO, etc (or some combination thereof).

The Fastino "narrow" interface does seem to work as expected, but the sustained event rate is rather low due to https://github.com/m-labs/artiq/issues/946

hartytp commented 4 years ago

Summarizing a long thread...

@pathfinder49 with reference to https://github.com/m-labs/artiq/issues/1521#issuecomment-695094661

With sufficent slack, wide interface events can be scheduled in bursts at 2.55 MS/s. The maximum burst length in my configuration appears to be a few 100s of events.

So, this is consistent with the claim that you can fill up the RTIO output FIFOs and drain them without issue, but the rate at which the CPU can generate wide events is much lower than 2.55MSPs.

Am I right in thinking that in this case you're updating all 32 channels with each sample?

Out of curiosity, do you have any idea what the maximum sustained event rate is for updating all 32 channels (using the _mu commands)? I'd assume that the rate in terms of samples/channel/second is still much higher than the narrow interface.

The wide interface results in bad output if combined with DMA

AFAICT you tested this at both 2.55MSPs and 1.3MSPs and found that in both cases there was what looked like missing samples (this is what "bad" means, right?). c.f. https://github.com/m-labs/artiq/issues/1521#issuecomment-695077546

This was tested with 100sample DMA records.

This could possibly be related to throughput (we haven't tested it with very low sample rates)? It's hard to tell without seeing screen shots of the behavior at different sample rates (I didn't see anything posted at 1.3MHz).

The narrow interface allows continuous updating with a maximum event rate of 1.3 MHz without using DMA

For reference/comparison with the above, this is updating only one channel at a time, and includes the CPU overhead for generating a sample (which is, unsurprisingly, much less than with the wide interface). It's not a direct comparison with the wide RTIO interface data above since that was for bursts.

The narrow interface allows sustained event rates of 2.55 MHz when using DMA (barring DMA improvements in the meantime)

I think that what you mean by this is 2.55MS/(channel*s), right? i.e. you can saturate a single channel (maybe even ~1.5 channels), but not two channels simultaneously.

But, importantly there were no glitches observed with the narrow interface.

Bursts with the narrow interface will need to be shorter than with the wide interface due to the RTIO buffer size limitation

You didn't mention the number of channels in your post, but I assume that here you mean that the number of (samples*channels) will be lower for narrow than wide interface due to the FIFO size.


tl;dr the narrow interface works about as well as expected from the DMA performance. The wide interface seems to hit a bug somewhere in ARTIQ when used with DMA. The wide interface may still be useful (e.g. if you want a few hundred updates on multiple channels, that's currently possible with the current wide interface, but not with the narrow interface), but long term someone needs to look at this.

@jbqubit FYI...if this is an issue with wide RTIO and DMA (which is a possibility given the above) it may also affect Sayma. As far as I'm aware there has been no testing of wide RTIO and DMA (but correct me if I'm wrong @sbourdeauducq ).

hartytp commented 4 years ago

For reference: the Zynq DMA is about a factor of 5-10 better than Kasli right now, presumably this is similar to the performance we could get with the current hardware by implementing https://github.com/m-labs/artiq/issues/946#issuecomment-371761818.

See cf https://github.com/m-labs/artiq/issues/946 for details.

aktentasche commented 2 years ago

Any chance this will be tackled soon? Or a point where to start in the ARTIQ codebase?

The wide interface seems to hit a bug somewhere in ARTIQ when used with DMA.

is not enough info to start unfortunately.

jordens commented 2 years ago

Any chance this will be tackled soon? Or a point where to start in the ARTIQ codebase?

Other than locating it in the DMA/RTIO code area there has not been any further pinpointing that I know of.

The wide interface seems to hit a bug somewhere in ARTIQ when used with DMA.

is not enough info to start unfortunately.

It is where we start debugging it.

sbourdeauducq commented 1 year ago

Likely fixed by ea9fe9b4e1b43b4cbb4c79dea722e469540d2530