RTIO scalable event dispatcher (fka. SRTIO)

sbourdeauducq commented 7 years ago

The current RTIO system uses one dedicated FIFO per output channel. While this architecture is fine for the first ARTIQ systems that were rather small and simple and even for single-crate Metlino/Sayma systems, it shows limitations on more complex ones. By decreasing importance:

with DRTIO, the master needs to keep track, for each FIFO in each satellite, a lower bound on the number of available entries plus the last timestamp written. The timestamp is stored in order to detect sequence errors rapidly (and allow precise exceptions without compromising performance). When many satellites are involved, especially with DRTIO switches, the storage requirements become prohibitive.
with many channels in one device, the large muxes and the error detection logic that can handle all the FIFOs make timing closure problematic.
with many channels in one device, the FIFOs waste FPGA space, as they are never all filled at the same time.

This proposal addresses those issues:

only one lower bound on the available entries needs to be stored per satellite device for flow control purposes (let's call this "number of tokens"). Sequence errors no longer exist (non-increasing timestamps into one channel are permitted to an extent) so rapid detection of them is no longer required.
the events can be demultiplexed to the different channels using pipeline stages that ease timing.
only a few FIFOs are required and they are shared between the channels.

Drawbacks are slightly increased complexity, and latency increased by a few coarse RTIO cycles by the output network (6 cycles, i.e. typically 48ns, for 8 FIFOs - increases with the square of the logarithm of the FIFO count).

The proposed SRTIO core contains a configurable number of FIFOs that hold the usual information about RTIO events (timestamp, address, data), the channel number, and a sequence number. The sequence number is increased for each event submitted.

When an event is submitted, it is written into the current FIFO if its timestamp is strictly increasing. Otherwise, the current FIFO number is incremented by one (and wraps around, if the current FIFO was the last) and the event is written there, unless that FIFO already contains an event with a greater timestamp. In that case, an asynchronous error is reported. If the destination FIFO is full, the submitter is blocked.

Alternatively, the gateware could look for any FIFO that can accommodate the event (is not full, and does not contain and event with a greater timestamp) and block the submitter until it succeeds. This allows for greater FIFO utilization than blocking on the current FIFO when it is full. However, the gateware has very little time to find an usable FIFO, to avoid undermining the performance of the submitter. This can cause timing problems if the number of FIFO is large.

At the output of the FIFOs, the events are distributed to the channels and simultaneous events on the same channel are handled using a structure similar to a odd-even merge-sort network that sorts by channel and sequence number. When there are simultaneous events on the same channel, the event with the highest sequence number is kept and a flag is raised to indicate that a replacement occured on that channel. If a replacement was made on a channel that has replacements disabled, the final event is dropped and a collision error is reported asynchronously.

Underflow errors are detected as before by comparing the event timestamp with the current value of the counter, and dropping events that do not have enough time to make it through the system.

The sequence number should be sized to be able to represent the combined capacity of all FIFOs, plus 2 bits that allow the detection of wrap-arounds.

The maximum number of simultaneous events (on different channels), and the maximum number of timeline "rewinds", are equal to the number of FIFOs.

The SRTIO logic should support both synchronous and asynchronous FIFOs, which are used respectively for local RTIO and DRTIO.

To implement flow control in DRTIO, the master queries the satellite for tokens. The satellite can use as a token count the space available in its FIFO that has the least such availability.

jordens commented 7 years ago

One clarification on FIFO level and timestamp storage: If we give room for 4k RTIO channels for a large system (SAWG is 10+ each), 64 bits for the TS estimate and 16 bits for the level estimate, then that would use 12 RAMB36E1. That is not prohibitive. Such a compact storage for that data obviously depends on enumeration and DRTIO switch support throughout (#619). The enumeration and flattening of the address space would also solve the hierarchical channel addressing problem that will crop up again in any case.

But as Sébastien explains, efficient usage of the RTIO data FIFOs at all levels is driving this design.

And should be "... number of pending or in-flight timeline 'rewinds'". The total number is of course unlimited.

hartytp commented 7 years ago

Thanks for posting this @sbourdeauducq and @jordens.

Context

The most complex experiments we'd like to construct over the next 5 years will require (all numbers are rough estimates):
- 5 Sayma
- 30 channels Urukul + Novogorny Servo
- 200 channels of Zotino or a TBD fast (~MHz analog bandwidth) uTCA DAC for ion shuttling
- 32 DIO (PMT/APD, RF in for micromotion detection, controlling misc hardware in real time)
For this kind of system, we'd envisage 1-2 uTCA crate(s) and a few Euroracks. Metlino will be the master.
Will a Metlino have enough SFP to directly control this kind of system? AFIACT, it currently has 3 downstream SFPs, but more could be supported by using FMC->SFP adapter boards. Currently, we'd probably want to use at the very minimum 1 of the 2 FMCs on Metlino for low-latency DIO. More FMCs will be available on the (as yet undesigned/unfunded) Metlino RTM.
So, Metlino, may just have enough SFP to control this kind of experiment directly, without DRTIO switching, but it looks a bit marginal.
DRTIO switching would also be nice as, for example, it lets us add a few Kasli here and there to do small jobs in our lab without having to stress about using up our limited number of SFPs.
We will also construct some simpler/slower experiments with something like:
- 1-2 Sayma
- 30 channels of Urukul + Novogorny Servo
- 1-2 Zotino for trapping potentials/shuttling (no uTCA DACs)
- 32 DIO
For these systems, we'd like to avoid uTCA chassis altogether as they are expensive (chassis + Metlino likely to be ~£15k), bulky and have long lead times. Instead, mount the Sayma in stand-alone enclosures like the one Greg is designing (has designed?)
Use Kasli as master. Needs DRTIO switching as Kasli only has 2 downstream SFPs.
It may also be possible to use Metlino + FMC to SFP mezzanines without a uTCA.4 rack.

hartytp commented 7 years ago

Current situation

As I understand it:

The DRTIO master currently stores the following information for each DRTIO channel on each slave: timestamp of last even written (64 bits), lower bound on amount of space left in FIFO (16 bits). So, 10 bytes per DRTIO slave. (These numbers are from memory so may be slightly out, I'll check the exact numbers later and edit).
This information is stored for a fixed 1024 channels for each DRTIO slave, regardless of how many channels are actually used
- A SPI/TTL PHY uses 1 DRTIO channel
- A Sayma RF output uses 10 channels for the SAWG (surely this is a bit excessive? Do we really need to adjust all the configuration settings in parallel?), as well as something like 1 for the RF switch and 1 for the attenuator. The ADCs + PGIA etc will also use some channels
Storing the timestamp + FIFO room on a per channel basis on the master allows the master to check for sequence errors and underruns as it writes events to the DRTIO slaves. This allows us to catch them in python exceptions (what @sbourdeauducq refers to as "precise error handling"). If this info is stored on the slave then error handling would be done using the non-real-time DRTIO-aux channel. Note that currently, we get the error notification as soon as the event is loaded into the FIFO. Using a non-real-time method, we would have to wait a while -- possibly until after the experiment had finished.
Storing 1024*10 bytes for each DRTIO slave in RAM on the master limits the number of DRTIO slaves we can have. This limit is smaller than the number of DRTIO slaves we'd need for the experiments described above. Hence the need for a "scalable" solution.
As a separate issue, we need to decide how to map a DRTIO address ("flat") into a routing path through the switch tree to an actual device ("hierarchical"). This is a basic aspect of implementing DRTIO switching, and is not done by the work already funded by ARL (this is more limited as it's just for the AMC to RTM bridge).

hartytp commented 7 years ago

Scalable solutions

AFAICT in the above comments, there are two proposed scalable solutions:

Only allocate space on the master for the DRTIO channels that are actually used.
- This is not scalable to an arbitrary number of channels due to limited BRAM, but is scalable for the medium term (5+ years)
- 4k RTIO channels should be more than enough for the experiments we need to do in the next 5 years
- 4k RTIO channels requires 12 36k BRAMS. 100T has 120 36kBRAMs. So this should fit fine on either Kasli or Metlino. If we had to, we could probably reduce the limit to 2k RTIO channels without problems
SRTIO
- Move from 1 FIFO per RTIO channel to a small number of FIFOs per DRTIO slave device. FIFOs are shared between all RTIO channels on the device.
- No longer detect sequence/timing errors at the master end. Instead, they are detected when events are copied from the FIFOs to the RTIO channels. So, we loose "precise error handling".
- Extra (fairly small) latency penalty
- Makes timing easier and lowers FIFO resources needed by DRTIO slave FPGAs
- Places stricter requirements on the ordering of DRTIO events for a slave. In the current DRTIO implementation, events must be ordered within a DRTIO channel, but not between DRTIO channels. In the proposed SRTIO implementation, the sharing of FIFOs places a constraint on the ordering of events between separate RTIO channels. This has implications for latency compensation, and some of the ways we program our sequences (e.g. using a decorator to bracket a gate in a spin-echo sequence by jumping the timeline around).

Is that all about right so far, or have I misunderstood things?

sbourdeauducq commented 7 years ago

We don't lose precise exceptions with SRTIO. Underflow errors are exactly the same as before. Sequence errors are usually not an error anymore, a decreasing timestamp in a channel is only an error when we run out of usable FIFOs to "rewind" the timeline (and that one error can be precise). Collision errors must be asynchronous, but they already are in ARTIQ-3 to accommodate DRTIO requirements.

sbourdeauducq commented 7 years ago

Places stricter requirements on the ordering of DRTIO events for a slave.

How many in-flight timeline rewinds do you expect? If there are more SRTIO FIFOs than rewinds then you're fine.

hartytp commented 7 years ago

@sbourdeauducq Thanks for the clarifications about precise exceptions.

How many in-flight timeline rewinds do you expect? If there are more SRTIO FIFOs than rewinds then you're fine.

@cjbe is the right person to ask about this.

From my understanding, there are a couple of situations where we rewind the timeline:

Wrapping on event in another. For example, if a sequence of operations involving one or more RTIO channel(s) on a device (e.g. 1 laser) is wrapped in a sequence (e.g. spin-echo) involving other RTIO channel(s). eg if we start by adding a Ramsey sequence, with the second pi/2-pulse in the future, before adding all the gates that go within the Ramsey sequence. In this case, we immediately loose 1 FIFO. There are a couple of other situations where this can occur, eating into the FIFOs.
Latency compensation: for readout and state prep, we often have few hundred ns AOM pulses using a few lasers, which can have up to a us of latency. In this case, interleaving pulses with different lasers also requires rewinds.

8 FIFOS might be cutting it a bit close, 16 or 32 would be better.

hartytp commented 7 years ago

AFAICT, Proposal 1 ("Only allocate space on the master for the DRTIO channels that are actually used") seems to be a better fit to our use cases than SRTIO because:

while it is a non-trivial amount of work, it still sounds simpler than SRTIO. Is that correct?
Proposal 1 (4k RTIO channels distributed between all DRTIO slaves) works for all of our anticipated use cases for the next 5 years or so. I'm not sure it's worth trying to make something that scales beyond that, because I'm not even sure our needs are well defined beyond that.
SRTIO introduces extra latency to DRTIO. In general, latency and speed our our biggest concern with ARTIQ. IIRC, Xilinx transceivers limit DRTIO latency to ~300ns, so while the extra latency isn't huge, it's not totally negligible. In general, I don't want to do anything that makes ARTIQ slower unless we absolutely have to.
We need to think a bit more about how many FIFOs we'd need to avoid rewinding issues if we used SRTIO. If we make this number too large, the advantage of SRTIO is lost anyway...

sbourdeauducq commented 7 years ago

while it is a non-trivial amount of work, it still sounds simpler than SRTIO. Is that correct?

Maybe a bit, but is not trivial either to allocate addresses and distribute the memory among the multiple DRTIO master cores in a way that results in good performance (we may want to use multiple DRTIO links at the same time later, e.g. with a more powerful DMA core), meets timing in slow Xilinx silicon, and is not too ugly. SRTIO doesn't require memory and thus avoids this issue entirely.

hartytp commented 7 years ago

@sbourdeauducq Thanks for the clarification.

I need to have more of a think about this and get back to you.

In the mean time, since this is a pretty major infrastructural change to ARTIQ, I'd be interested to hear opinions from other users @dhslichter @jboulder etc...

dhslichter commented 7 years ago

I need some time to think hard about this, and I would also recommend including others like @dleibrandt @amhankin @r-srinivas @dtcallcock in the thought process.

Several initial thoughts:

simple is better. To me, it seems that re-architecting the DRTIO into this SRTIO is potentially fraught with pitfalls, while there could be some considerably lower-hanging fruit that will get us where we need to go in the next ~few years as @hartytp has suggested. It is up to @sbourdeauducq and @jordens to clarify exactly how these two options might stack up against each other in terms of complexity to implement. For example, how about changing the 1024 timestamps per slave to 512? Are there any slaves that will really have more than 512 channels? Can we trim the number of RTIO channels for SAWG?
to what degree does this decision impact hardware design? It seems that the choice between the two of @hartytp's proposed solutions is relatively hardware-independent for the next ~5 years or so, based on available resources for option 1 in Kasli/Metlino. If the choice of SRTIO or modified DRTIO storage on the core device can be done simply with gateware modifications, the risk is somewhat mitigated because there can be a fallback.

On the SRTIO idea in particular:

would it defeat the purpose if we reduce the FIFO depths, but not the number of FIFOs, and then allow them to run in this "shared FIFO pool" manner? This would allow us to use the FIFO space much more efficiently (and thus spare FPGA resources), and would allow us to push the collision detection to the slaves as well, but would help avoid some of the potential issues of FIFO blocking on unusual pulse sequences such as what @hartytp and @cjbe have discussed.
I am in general very wary of things which can handicap full generality for pulse sequence generation.
I am OK with asynchronous reporting of collisions from slaves as long as it occurs at some regular intervals (i.e. doesn't necessarily wait for an entire kernel to complete running before the error is reported -- maybe once per ms or so?).

sbourdeauducq commented 7 years ago

simple is better. To me, it seems that re-architecting the DRTIO into this SRTIO is potentially fraught with pitfalls

See my comment above, DRTIO with switches/many devices and without SRTIO isn't very nice either.

to what degree does this decision impact hardware design?

It does not, this is all gateware.

would it defeat the purpose if we reduce the FIFO depths, but not the number of FIFOs, and then allow them to run in this "shared FIFO pool" manner?

The core would not move to the next FIFO if the current one is full (only if you send a decreasing timestamp). The space in the SRTIO FIFOs cannot be combined arbitrarily.

I am in general very wary of things which can handicap full generality for pulse sequence generation.

The current architecture does not allow you to go back in time on the same channel, whereas SRTIO does.

I am OK with asynchronous reporting of collisions from slaves as long as it occurs at some regular intervals (i.e. doesn't necessarily wait for an entire kernel to complete running before the error is reported -- maybe once per ms or so?).

Asynchronous errors already exist in ARTIQ-3 and are reported rapidly, via the core device log; latency is variable but ~ms at most.

dleibrandt commented 7 years ago

The core would not move to the next FIFO if the current one is full (only if you send a decreasing timestamp). The space in the SRTIO FIFOs cannot be combined arbitrarily.

This seems somewhat restrictive to me. I find that in practical use, a few of the RTIO channels handle the majority of the total events, and that is necessary to increase the FIFO depths of those channels to be quite high. With the current non-scalable RTIO, it is easy to allocate long FIFOs to the channels that need them. But it sounds like with SRTIO, I would have to increase all of the FIFO depths to achieve the same effect, which may not be possible given the resource limitations. Would it be possible to relax this restriction?

From: Sébastien Bourdeauducq notifications@github.com Sent: Tuesday, July 11, 2017 7:56:48 PM To: m-labs/artiq Cc: Leibrandt, David R. (Fed); Mention Subject: Re: [m-labs/artiq] scalable RTIO (SRTIO) (#778)

simple is better. To me, it seems that re-architecting the DRTIO into this SRTIO is potentially fraught with pitfalls

See my comment above, DRTIO with switches/many devices and without SRTIO isn't very nice either.

to what degree does this decision impact hardware design?

It does not, this is all gateware.

would it defeat the purpose if we reduce the FIFO depths, but not the number of FIFOs, and then allow them to run in this "shared FIFO pool" manner?

The core would not move to the next FIFO if the current one is full (only if you send a decreasing timestamp). The space in the SRTIO FIFOs cannot be combined arbitrarily.

I am in general very wary of things which can handicap full generality for pulse sequence generation.

The current architecture does not allow you to go back in time on the same channel, whereas SRTIO does.

I am OK with asynchronous reporting of collisions from slaves as long as it occurs at some regular intervals (i.e. doesn't necessarily wait for an entire kernel to complete running before the error is reported -- maybe once per ms or so?).

Asynchronous errors already exist in ARTIQ-3 and are reported rapidly, via the core device log; latency is variable but ~ms at most.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/m-labs/artiq/issues/778#issuecomment-314622203, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANJjQNhICF2HImHN_2BPap-KVf_Us3_6ks5sNCfggaJpZM4OSpJw.

sbourdeauducq commented 7 years ago

But it sounds like with SRTIO, I would have to increase all of the FIFO depths to achieve the same effect, which may not be possible given the resource limitations. Would it be possible to relax this restriction?

I think so. See the "Alternatively, the gateware could look for any FIFO..." paragraph above. It may cost a few more cycles to submit events though, or at least enlarge the underflow window by a few cycles (with no throughput penalty).

hartytp commented 7 years ago

Thanks @dhslichter and @dleibrandt for the feedback.

@sbourdeauducq The feeling here is that we're leaning towards the "Only allocate space on the master for the DRTIO channels that are actually used" rather than the "SRTIO" proposal because:

Previously discussed concerns about timeline rewind behaviour under the proposed SRTIO implementation (ping @cjbe if you want more details about this). Although, the fact that SRTIO lets one rewind the timeline on a single channel is nice.
Similar concerns to @dleibrandt about FIFO usage under SRTIO
SRTIO as described above would increase the DRTIO latency -- particularly if we end up needing more SRTIO FIFOs, or implementing the empty FIFO search when the FIFO gets full

Maybe a bit, but is not trivial either to allocate addresses and distribute the memory among the multiple DRTIO master cores in a way that results in good performance (we may want to use multiple DRTIO links at the same time later, e.g. with a more powerful DMA core), meets timing in slow Xilinx silicon, and is not too ugly. SRTIO doesn't require memory and thus avoids this issue entirely.

How bad are these issues?

Are you running into a lot of timing issues at the moment that SRTIO would help with?

I haven't thought too much about DDMA etc. If you think that SRTIO is the only clean way to implement it then that could be a strong argument, but I'd still like to explore other options given my previous comments.

dhslichter commented 7 years ago

My sense is that SRTIO as proposed, or some variant thereof, is probably where things will need to go in the long term, but that the near-term cost and delays and debugging from implementing it appear to me (and to the others who have commented above) to be problematic relative to the "hack" fix of DRTIO with "only allocate space on the master for the DRTIO channels that are actually used", as described by @hartytp above. So we'd like to aim for the latter, while continuing the design discussions for SRTIO so that we will be ready to implement something more scalable like that down the road a few years.

jordens commented 7 years ago

@dleibrandt I'd estimate around 8-16 SRTIO FIFOs. Making all of them large enough to handle the max backlog does not sound too restrictive if one considers that it saves a lot of unused memory from "current RTIO". Also is enlarging the FIFOs just a workaround for a small sustained event rate? If yes, then avoiding (this) SRTIO design because it makes a work around for another bottle neck harder to implement is a priority inversion. If that work around is needed and ends up hard to do with SRTIO, one should work hard on increasing the sustained event rate (any or a combination of the options that are floating around) at the same time as SRTIO lands.

In general I expect problems with the current RTIO design in the near term, even just on Sayma without DRTIO. Ultimately nearly all RTIO channels will be connected to multiple large MUXes (monitoring, injection, analyzer, DMA, the CPU bus). I am already seeing timing and routing issues when building some configurations of phaser. That will only get worse when the number of 10 RTIO channels for each Sayma DAC channel goes to >25 and each of them also receives mon+inj+proper analyzer support (@hartytp as soon as somebody wants to be able to change just (any) two settings in parallel, there need to be that many RTIO channels. This is an issue that I have brought up already and it ended up in the SAWG design like this. You may discuss this with @jbqubit. But also notice how the exact same problem of having many RTIO channels to do parallel things occurs even at the leaf. SRTIO as a generic pattern would also help here).

I am all in favor of implementing that address space enumeration and flattening that we proposed above as a first step. It is at least somewhat orthogonal to SRTIO and can be tackled first. But @sbourdeauducq and I don't completely agree on the estimated complexity of that (let's call it "flat DRTIO") and the SRTIO design above.

dhslichter commented 7 years ago

@jordens ack. Sounds like we should keep thinking about how to implement SRTIO in a good way and hammer out a good initial design spec in the near term that will handle the various issues discussed above, even as hacks are applied to the current DRTIO to keep it going.

dleibrandt commented 7 years ago

Also is enlarging the FIFOs just a workaround for a small sustained event rate? If yes, then avoiding (this) SRTIO design because it makes a work around for another bottle neck harder to implement is a priority inversion. If that work around is needed and ends up hard to do with SRTIO, one should work hard on increasing the sustained event rate (any or a combination of the options that are floating around) at the same time as SRTIO lands.

Partially, yes, I had enlarged the FIFOs used for outputting the sideband cooling pulses to avoid underflow errors. Presumably, this will no longer be necessary in ARTIQ 3 when using DMA, and I agree that further improvements should be made to further increase the sustained event rate rather than letting this limitation drive similar design decisions.

I had also enlarged the FIFO connected to my PMT, so that I can do long detection pulses (with thousands of input events) for micromotion time correlation type measurements. This use case will not be addressed with a higher sustained (output) event rate, although there are probably other ways to address this.

In any case, if we can make all of the SRTIO FIFOs something like 1e4 events deep, than this is a nonissue.

hartytp commented 7 years ago

@sbourdeauducq Maybe a bit, but is not trivial either to allocate addresses and distribute the memory among the multiple DRTIO master cores in a way that results in good performance (we may want to use multiple DRTIO links at the same time later, e.g. with a more powerful DMA core), meets timing in slow Xilinx silicon, and is not too ugly. SRTIO doesn't require memory and thus avoids this issue entirely.

@jordens I am all in favor of implementing that address space enumeration and flattening that we proposed above as a first step. It is at least somewhat orthogonal to SRTIO and can be tackled first. But @sbourdeauducq and I don't completely agree on the estimated complexity of that (let's call it "flat DRTIO") and the SRTIO design above.

In general I expect problems with the current RTIO design in the near term, even just on Sayma without DRTIO.

Assuming:

"flat DRTIO" (FDRTIO?) is not significantly simpler/cheaper/quicker to implement than SRTIO
FDRTIO is orthogonal to SRTIO in the sense that implementing FDRTIO will not significantly reduce the cost/speed/risk of implementing SRTIO later
We will need (some form of) SRTIO in the near future to overcome timing issues etc.

If those assumptions are roughly valid, I'd prefer to go straight for SRTIO.

hartytp commented 7 years ago

For SRTIO: let's assume we go for 16 FIFOs, each with a depth of ~1e4 events or more.

That seems to resolve the concerns about FIFO depth.

@sbourdeauducq @jordens can you confirm what the expected latency will be for: current DRTIO (limited by transceivers?) and SRTIO?

@cjbe Does having 16 FIFOs per DRTIO slave resolve your timeline unwinding concerns? If not, can you post some more details about your exact concerns so we can look for a solution?

Does anyone else see any other potential issues with SRTIO as outlined by @sbourdeauducq ?

sbourdeauducq commented 7 years ago

if we can make all of the SRTIO FIFOs something like 1e4 events deep

This doesn't sound very good, with SRTIO all FIFO entries need to be able to hold the data of the PHY with the widest data, which is quite large (100's of bits) with the SAWG. With 16 SRTIO FIFOs we are well into the megabytes of BRAM with that approach.

We can do:

more flexible distribution of the events into the SRTIO FIFOs (https://github.com/m-labs/artiq/issues/778#issuecomment-314735070), and/or
devise a new scheme where events with wide data occupy several consecutive FIFO slots. Beside some more practical issues, there is a fundamental limit on the event rate as it would take several clock cycles to transfer a wide data event from the FIFO to the PHY, which may or may not be a problem with the SAWG (@jordens ?)

sbourdeauducq commented 7 years ago

To get an idea of what amount of BRAM is reasonable: the FPGA that will support the SAWG is a KU040, with 21.1 megabits of BRAM in total.

hartytp commented 7 years ago

True, 1e3 events per FIFO is probably more realistic. @dleibrandt @cjbe Would that be enough for you?

If not, we should consider @sbourdeauducq "more flexible distribution of events into the SRTIO FIFOs" proposal. I'd like to avoid this if possible, as it sounds like it will increase the cost/complexity of SRTIO. Also, depending on how it's implemented, it could increase the latency further, which we're very keen to avoid if possible.

jordens commented 7 years ago

RTIO data width: This is not all that relevant to SAWG. There it's probably fine to (and that is what splitting a wide event over multiple smaller ones does effectively) increase the minimum time between two events on a channel by a factor of four or so. The event width is most relevant to raw DACs/ADCs that need ~2048 bit per event (at 125 MHz RTIO coarse, 16 GS/s, 16 bit). They would want to be fed bursts at full speed (independent of whether that can be sustained long term by the DRTIO links or DRAM). But splitting RTIO events becomes irritatingly inefficient when we hit a size of about 256 bit. Around that size we'd want/need to look into compressing the timestamp as well.

hartytp commented 7 years ago

Consensus of internal discussions about this:

With 16 FIFOs we do not anticipate any timeline unwinding issues
1e3 event depth for all FIFOs should be sufficient for us
Latency is a serious concern for us, but the numbers @sbourdeauducq mentioned earlier look just about acceptable (but lower would be better).

@sbourdeauducq:

Please could you confirm the expected latency for: current DRTIO implementation; proposed SRTIO implementation given above parameters; SRTIO with switching.
Re your proposal for "more flexible distribution of events into the SRTIO FIFOs": could you elaborate on your comment that this could be possible without increasing latency at the cost of "enlarge[ing] the underflow window by a few cycles (with no throughput penalty)."? Can you also give me an indication of how expensive this feature would be to implement?
Are there any other factors we should think about/things that need specifying for this?

sbourdeauducq commented 7 years ago

I didn't say it wouldn't impact latency, I said it wouldn't impact throughput. The basic idea is to pipeline the FIFO selection system, and increase the underflow margin so that the event has time to make it through that pipeline. The pipeline increases latency by the number of stages it has, but you can submit a new event before the previous one has made it through. The pipeline system would need some heuristics, since the selection of a FIFO for one event may impact the decision for the very next event (e.g. if the FIFO becomes full as a result of the writing of the first event). All those short dependencies would have to be eliminated.

hartytp commented 7 years ago

I didn't say it wouldn't impact latency, I said it wouldn't impact throughput. The basic idea is to pipeline the FIFO selection system, and increase the underflow margin so that the event has time to make it through that pipeline. The pipeline increases latency by the number of stages it has, but you can submit a new event before the previous one has made it through. The pipeline system would need some heuristics, since the selection of a FIFO for one event may impact the decision for the very next event (e.g. if the FIFO becomes full as a result of the writing of the first event). All those short dependencies would have to be eliminated.

Okay, thanks for clarifying that. Let's not do this, at least for now. With a 1k FIFO depth, this shouldn't be necessary.

sbourdeauducq commented 7 years ago

The major source of additional latency in SRTIO is the output/sorter stage after the FIFOs (a few RTIO clock cycles as I mentioned in the first post). DRTIO latencies are not significantly affected by SRTIO, and are dominated by the transceivers.

hartytp commented 7 years ago

great.

jordens commented 7 years ago

Funded by Oxford, for local RTIO and DRTIO, DRTIO trees and Kasli Repeater and Master modes, outputs only. @sbourdeauducq Let's condense (or copy) the final design into a wiki page (or code documentation).

sbourdeauducq commented 7 years ago

Let me just write that code first, I need to do some experimentation e.g. with the FIFO selector, then it can be documented.

jbqubit commented 7 years ago

Splendid! This will help Sayma RTIO and DSP modulation.

sbourdeauducq commented 7 years ago

I have an idea to spread the events among the FIFOs without requiring complex, slow, large or high-latency logic. Do what @jordens proposes (switch to the next FIFO if the timestamp is not strictly increasing), but also switch to the next FIFO if the current one had been full.

It makes sequence errors more obscure though, as the error condition isn't just "the timeline has been rewound too many times in too little time" anymore.

The "spread" feature could easily be made optional and enabled/disabled at gateware compile time.

hartytp commented 7 years ago

@sbourdeauducq I'd be fine with that.

sbourdeauducq commented 7 years ago

Key components are there (in the rtio-sed branch) with simulations/unit tests; needs gluing, adding a few relatively simple things like FIFOs and timestamp counter, and testing on the board.

dhslichter commented 7 years ago

I think this sounds fine, provided that the number of FIFOs is perhaps 2-3x a "typical" number of timeline rewinds. I know this is a hard number to pin down, but maybe for now a reasonable first pass would be roughly 8 TTLs per FIFO? Will things need to be substantially more aggressive to achieve the resource efficiencies we desire with SRTIO?

sbourdeauducq commented 7 years ago

8 TTLs per FIFO?

dhslichter commented 7 years ago

I should perhaps say 8 RTIO channels (not necessarily TTLs) per FIFO. Basically this would be an 8x reduction in the number of FIFOs relative to the current design. They would potentially need to be made deeper, but not 8x as deep.

This is just a straw man, not something I have calculated out in great detail. But I think the point is that we want to have enough FIFOs that using the arbitration scheme discussed above for stuffing events into FIFOs, we are very unlikely to end up blocking while we wait for FIFOs to clear. Since it's hard to say how many timeline rewinds will be in a given experiment in general, I think that it's perhaps easier to guess about what fraction of RTIO channels will experience timeline rewinds, and apportion the number of FIFOs in relation to the number of RTIO channels accordingly.

The general problem of figuring out how the system will behave under arbitrary experiments is very hard, and if you have an experiment which rewinds the timeline several times you can quickly run out of FIFOs. That's why I suggested this ratio as a (hopefully) somewhat conservative number of FIFOs to use. We have discussed things like 16 or 32 FIFOs above, but that's not really meaningful unless you put it in the context of how many RTIO channels those FIFOs are serving. Assuming they are for RTIO channel counts of the sort current implemented, the 8 channels/FIFO ratio seems generally in the right ballpark.

In the Oxford use case, from my understanding, it is possible to have nested timeline rewinds, such that each subsequent rewind goes back farther than the previous one -- this is a natural construction for many quantum information experiments, but ends up being very FIFO-intensive because each rewind will need to go to a new FIFO.

sbourdeauducq commented 7 years ago

Local SED now works on the board, except that DMA systematically produces an underflow for some reason. And I want to look into reducing the output latency a bit.

sbourdeauducq commented 7 years ago

The SED latency at the FIFO output is:

1 cycle for comparing the timestamp to the counter.
6 cycles for the output network - increases as log²(lane count).
1 cycle for looking up for each channel whether replacement is permitted or not, generating collision errors, and removing those events that have them.
1 cycle for distributing the data of the lanes to the channels and demultiplexing and driving strobe signals.

This makes a total of 9 cycles.

With traditional RTIO, there is only 1 cycle of latency at the FIFO output for comparing timestamps.

The latency at the FIFO input is reduced by 1 cycle, as the buffer/guard time mechanism for replacements is no longer necessary. So in total, SED increases the latency by 7 RTIO clock cycles.

dhslichter commented 7 years ago

@sbourdeauducq this is nice. However, to continue the discussion from #40, if there is automatic latency compensation, you will get very different FIFO usage depending on the order in which pulses in a with parallel block are given (as the latency compensation might render them in order or out of order, after being applied). It seems that it would make sense, to the extent possible, to bring the with interleave up to speed to help with this. I am afraid that very simple, casual choices (the ordering in which events in a with parallel are listed) have the potential to create RTIO errors due to FIFOs filling and blocking. I know that your example in #40 shows that for just a few pulses, the FIFOs are able to handle things, but remember that if the number of FIFOs is significantly less than the number of RTIO channels, and if the experiment is complex (many RTIO channels on one device all firing at roughly the same time, especially if you are doing a DMA of something like sideband cooling for multiple ion zones in parallel, say), you may hit these sorts of limitations early.

My point with all of this is that the simple latency compensation in #40 with negative delays does not necessarily play nicely with SED, and I would advocate that the best way to handle things would be to modify the way the latency compensation is handled (e.g. to have it done in software before compile time, to the extent possible, with features like with interleave) so that we avoid emitting out-of-order timestamps from the kernel when at all possible.

m-labs / artiq

RTIO scalable event dispatcher (fka. SRTIO) #778