Closed sbourdeauducq closed 7 years ago
One clarification on FIFO level and timestamp storage: If we give room for 4k RTIO channels for a large system (SAWG is 10+ each), 64 bits for the TS estimate and 16 bits for the level estimate, then that would use 12 RAMB36E1. That is not prohibitive. Such a compact storage for that data obviously depends on enumeration and DRTIO switch support throughout (#619). The enumeration and flattening of the address space would also solve the hierarchical channel addressing problem that will crop up again in any case.
But as Sébastien explains, efficient usage of the RTIO data FIFOs at all levels is driving this design.
And should be "... number of pending or in-flight timeline 'rewinds'". The total number is of course unlimited.
Thanks for posting this @sbourdeauducq and @jordens.
Context
Current situation
As I understand it:
Scalable solutions
AFAICT in the above comments, there are two proposed scalable solutions:
Only allocate space on the master for the DRTIO channels that are actually used.
SRTIO
Is that all about right so far, or have I misunderstood things?
We don't lose precise exceptions with SRTIO. Underflow errors are exactly the same as before. Sequence errors are usually not an error anymore, a decreasing timestamp in a channel is only an error when we run out of usable FIFOs to "rewind" the timeline (and that one error can be precise). Collision errors must be asynchronous, but they already are in ARTIQ-3 to accommodate DRTIO requirements.
Places stricter requirements on the ordering of DRTIO events for a slave.
How many in-flight timeline rewinds do you expect? If there are more SRTIO FIFOs than rewinds then you're fine.
@sbourdeauducq Thanks for the clarifications about precise exceptions.
How many in-flight timeline rewinds do you expect? If there are more SRTIO FIFOs than rewinds then you're fine.
@cjbe is the right person to ask about this.
From my understanding, there are a couple of situations where we rewind the timeline:
8 FIFOS might be cutting it a bit close, 16 or 32 would be better.
AFAICT, Proposal 1 ("Only allocate space on the master for the DRTIO channels that are actually used") seems to be a better fit to our use cases than SRTIO because:
while it is a non-trivial amount of work, it still sounds simpler than SRTIO. Is that correct?
Maybe a bit, but is not trivial either to allocate addresses and distribute the memory among the multiple DRTIO master cores in a way that results in good performance (we may want to use multiple DRTIO links at the same time later, e.g. with a more powerful DMA core), meets timing in slow Xilinx silicon, and is not too ugly. SRTIO doesn't require memory and thus avoids this issue entirely.
@sbourdeauducq Thanks for the clarification.
I need to have more of a think about this and get back to you.
In the mean time, since this is a pretty major infrastructural change to ARTIQ, I'd be interested to hear opinions from other users @dhslichter @jboulder etc...
I need some time to think hard about this, and I would also recommend including others like @dleibrandt @amhankin @r-srinivas @dtcallcock in the thought process.
Several initial thoughts:
On the SRTIO idea in particular:
simple is better. To me, it seems that re-architecting the DRTIO into this SRTIO is potentially fraught with pitfalls
See my comment above, DRTIO with switches/many devices and without SRTIO isn't very nice either.
to what degree does this decision impact hardware design?
It does not, this is all gateware.
would it defeat the purpose if we reduce the FIFO depths, but not the number of FIFOs, and then allow them to run in this "shared FIFO pool" manner?
The core would not move to the next FIFO if the current one is full (only if you send a decreasing timestamp). The space in the SRTIO FIFOs cannot be combined arbitrarily.
I am in general very wary of things which can handicap full generality for pulse sequence generation.
The current architecture does not allow you to go back in time on the same channel, whereas SRTIO does.
I am OK with asynchronous reporting of collisions from slaves as long as it occurs at some regular intervals (i.e. doesn't necessarily wait for an entire kernel to complete running before the error is reported -- maybe once per ms or so?).
Asynchronous errors already exist in ARTIQ-3 and are reported rapidly, via the core device log; latency is variable but ~ms at most.
The core would not move to the next FIFO if the current one is full (only if you send a decreasing timestamp). The space in the SRTIO FIFOs cannot be combined arbitrarily.
This seems somewhat restrictive to me. I find that in practical use, a few of the RTIO channels handle the majority of the total events, and that is necessary to increase the FIFO depths of those channels to be quite high. With the current non-scalable RTIO, it is easy to allocate long FIFOs to the channels that need them. But it sounds like with SRTIO, I would have to increase all of the FIFO depths to achieve the same effect, which may not be possible given the resource limitations. Would it be possible to relax this restriction?
From: Sébastien Bourdeauducq notifications@github.com Sent: Tuesday, July 11, 2017 7:56:48 PM To: m-labs/artiq Cc: Leibrandt, David R. (Fed); Mention Subject: Re: [m-labs/artiq] scalable RTIO (SRTIO) (#778)
simple is better. To me, it seems that re-architecting the DRTIO into this SRTIO is potentially fraught with pitfalls
See my comment above, DRTIO with switches/many devices and without SRTIO isn't very nice either.
to what degree does this decision impact hardware design?
It does not, this is all gateware.
would it defeat the purpose if we reduce the FIFO depths, but not the number of FIFOs, and then allow them to run in this "shared FIFO pool" manner?
The core would not move to the next FIFO if the current one is full (only if you send a decreasing timestamp). The space in the SRTIO FIFOs cannot be combined arbitrarily.
I am in general very wary of things which can handicap full generality for pulse sequence generation.
The current architecture does not allow you to go back in time on the same channel, whereas SRTIO does.
I am OK with asynchronous reporting of collisions from slaves as long as it occurs at some regular intervals (i.e. doesn't necessarily wait for an entire kernel to complete running before the error is reported -- maybe once per ms or so?).
Asynchronous errors already exist in ARTIQ-3 and are reported rapidly, via the core device log; latency is variable but ~ms at most.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/m-labs/artiq/issues/778#issuecomment-314622203, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANJjQNhICF2HImHN_2BPap-KVf_Us3_6ks5sNCfggaJpZM4OSpJw.
But it sounds like with SRTIO, I would have to increase all of the FIFO depths to achieve the same effect, which may not be possible given the resource limitations. Would it be possible to relax this restriction?
I think so. See the "Alternatively, the gateware could look for any FIFO..." paragraph above. It may cost a few more cycles to submit events though, or at least enlarge the underflow window by a few cycles (with no throughput penalty).
Thanks @dhslichter and @dleibrandt for the feedback.
@sbourdeauducq The feeling here is that we're leaning towards the "Only allocate space on the master for the DRTIO channels that are actually used" rather than the "SRTIO" proposal because:
Maybe a bit, but is not trivial either to allocate addresses and distribute the memory among the multiple DRTIO master cores in a way that results in good performance (we may want to use multiple DRTIO links at the same time later, e.g. with a more powerful DMA core), meets timing in slow Xilinx silicon, and is not too ugly. SRTIO doesn't require memory and thus avoids this issue entirely.
How bad are these issues?
Are you running into a lot of timing issues at the moment that SRTIO would help with?
I haven't thought too much about DDMA etc. If you think that SRTIO is the only clean way to implement it then that could be a strong argument, but I'd still like to explore other options given my previous comments.
My sense is that SRTIO as proposed, or some variant thereof, is probably where things will need to go in the long term, but that the near-term cost and delays and debugging from implementing it appear to me (and to the others who have commented above) to be problematic relative to the "hack" fix of DRTIO with "only allocate space on the master for the DRTIO channels that are actually used", as described by @hartytp above. So we'd like to aim for the latter, while continuing the design discussions for SRTIO so that we will be ready to implement something more scalable like that down the road a few years.
@dleibrandt I'd estimate around 8-16 SRTIO FIFOs. Making all of them large enough to handle the max backlog does not sound too restrictive if one considers that it saves a lot of unused memory from "current RTIO". Also is enlarging the FIFOs just a workaround for a small sustained event rate? If yes, then avoiding (this) SRTIO design because it makes a work around for another bottle neck harder to implement is a priority inversion. If that work around is needed and ends up hard to do with SRTIO, one should work hard on increasing the sustained event rate (any or a combination of the options that are floating around) at the same time as SRTIO lands.
In general I expect problems with the current RTIO design in the near term, even just on Sayma without DRTIO. Ultimately nearly all RTIO channels will be connected to multiple large MUXes (monitoring, injection, analyzer, DMA, the CPU bus). I am already seeing timing and routing issues when building some configurations of phaser. That will only get worse when the number of 10 RTIO channels for each Sayma DAC channel goes to >25 and each of them also receives mon+inj+proper analyzer support (@hartytp as soon as somebody wants to be able to change just (any) two settings in parallel, there need to be that many RTIO channels. This is an issue that I have brought up already and it ended up in the SAWG design like this. You may discuss this with @jbqubit. But also notice how the exact same problem of having many RTIO channels to do parallel things occurs even at the leaf. SRTIO as a generic pattern would also help here).
I am all in favor of implementing that address space enumeration and flattening that we proposed above as a first step. It is at least somewhat orthogonal to SRTIO and can be tackled first. But @sbourdeauducq and I don't completely agree on the estimated complexity of that (let's call it "flat DRTIO") and the SRTIO design above.
@jordens ack. Sounds like we should keep thinking about how to implement SRTIO in a good way and hammer out a good initial design spec in the near term that will handle the various issues discussed above, even as hacks are applied to the current DRTIO to keep it going.
Also is enlarging the FIFOs just a workaround for a small sustained event rate? If yes, then avoiding (this) SRTIO design because it makes a work around for another bottle neck harder to implement is a priority inversion. If that work around is needed and ends up hard to do with SRTIO, one should work hard on increasing the sustained event rate (any or a combination of the options that are floating around) at the same time as SRTIO lands.
Partially, yes, I had enlarged the FIFOs used for outputting the sideband cooling pulses to avoid underflow errors. Presumably, this will no longer be necessary in ARTIQ 3 when using DMA, and I agree that further improvements should be made to further increase the sustained event rate rather than letting this limitation drive similar design decisions.
I had also enlarged the FIFO connected to my PMT, so that I can do long detection pulses (with thousands of input events) for micromotion time correlation type measurements. This use case will not be addressed with a higher sustained (output) event rate, although there are probably other ways to address this.
In any case, if we can make all of the SRTIO FIFOs something like 1e4 events deep, than this is a nonissue.
@sbourdeauducq Maybe a bit, but is not trivial either to allocate addresses and distribute the memory among the multiple DRTIO master cores in a way that results in good performance (we may want to use multiple DRTIO links at the same time later, e.g. with a more powerful DMA core), meets timing in slow Xilinx silicon, and is not too ugly. SRTIO doesn't require memory and thus avoids this issue entirely.
@jordens I am all in favor of implementing that address space enumeration and flattening that we proposed above as a first step. It is at least somewhat orthogonal to SRTIO and can be tackled first. But @sbourdeauducq and I don't completely agree on the estimated complexity of that (let's call it "flat DRTIO") and the SRTIO design above.
In general I expect problems with the current RTIO design in the near term, even just on Sayma without DRTIO.
Assuming:
If those assumptions are roughly valid, I'd prefer to go straight for SRTIO.
For SRTIO: let's assume we go for 16 FIFOs, each with a depth of ~1e4 events or more.
That seems to resolve the concerns about FIFO depth.
@sbourdeauducq @jordens can you confirm what the expected latency will be for: current DRTIO (limited by transceivers?) and SRTIO?
@cjbe Does having 16 FIFOs per DRTIO slave resolve your timeline unwinding concerns? If not, can you post some more details about your exact concerns so we can look for a solution?
Does anyone else see any other potential issues with SRTIO as outlined by @sbourdeauducq ?
if we can make all of the SRTIO FIFOs something like 1e4 events deep
This doesn't sound very good, with SRTIO all FIFO entries need to be able to hold the data of the PHY with the widest data, which is quite large (100's of bits) with the SAWG. With 16 SRTIO FIFOs we are well into the megabytes of BRAM with that approach.
We can do:
To get an idea of what amount of BRAM is reasonable: the FPGA that will support the SAWG is a KU040, with 21.1 megabits of BRAM in total.
True, 1e3 events per FIFO is probably more realistic. @dleibrandt @cjbe Would that be enough for you?
If not, we should consider @sbourdeauducq "more flexible distribution of events into the SRTIO FIFOs" proposal. I'd like to avoid this if possible, as it sounds like it will increase the cost/complexity of SRTIO. Also, depending on how it's implemented, it could increase the latency further, which we're very keen to avoid if possible.
RTIO data width: This is not all that relevant to SAWG. There it's probably fine to (and that is what splitting a wide event over multiple smaller ones does effectively) increase the minimum time between two events on a channel by a factor of four or so. The event width is most relevant to raw DACs/ADCs that need ~2048 bit per event (at 125 MHz RTIO coarse, 16 GS/s, 16 bit). They would want to be fed bursts at full speed (independent of whether that can be sustained long term by the DRTIO links or DRAM). But splitting RTIO events becomes irritatingly inefficient when we hit a size of about 256 bit. Around that size we'd want/need to look into compressing the timestamp as well.
Consensus of internal discussions about this:
@sbourdeauducq:
I didn't say it wouldn't impact latency, I said it wouldn't impact throughput. The basic idea is to pipeline the FIFO selection system, and increase the underflow margin so that the event has time to make it through that pipeline. The pipeline increases latency by the number of stages it has, but you can submit a new event before the previous one has made it through. The pipeline system would need some heuristics, since the selection of a FIFO for one event may impact the decision for the very next event (e.g. if the FIFO becomes full as a result of the writing of the first event). All those short dependencies would have to be eliminated.
I didn't say it wouldn't impact latency, I said it wouldn't impact throughput. The basic idea is to pipeline the FIFO selection system, and increase the underflow margin so that the event has time to make it through that pipeline. The pipeline increases latency by the number of stages it has, but you can submit a new event before the previous one has made it through. The pipeline system would need some heuristics, since the selection of a FIFO for one event may impact the decision for the very next event (e.g. if the FIFO becomes full as a result of the writing of the first event). All those short dependencies would have to be eliminated.
Okay, thanks for clarifying that. Let's not do this, at least for now. With a 1k FIFO depth, this shouldn't be necessary.
The major source of additional latency in SRTIO is the output/sorter stage after the FIFOs (a few RTIO clock cycles as I mentioned in the first post). DRTIO latencies are not significantly affected by SRTIO, and are dominated by the transceivers.
great.
Funded by Oxford, for local RTIO and DRTIO, DRTIO trees and Kasli Repeater and Master modes, outputs only. @sbourdeauducq Let's condense (or copy) the final design into a wiki page (or code documentation).
Let me just write that code first, I need to do some experimentation e.g. with the FIFO selector, then it can be documented.
Splendid! This will help Sayma RTIO and DSP modulation.
I have an idea to spread the events among the FIFOs without requiring complex, slow, large or high-latency logic. Do what @jordens proposes (switch to the next FIFO if the timestamp is not strictly increasing), but also switch to the next FIFO if the current one had been full.
It makes sequence errors more obscure though, as the error condition isn't just "the timeline has been rewound too many times in too little time" anymore.
The "spread" feature could easily be made optional and enabled/disabled at gateware compile time.
@sbourdeauducq I'd be fine with that.
Key components are there (in the rtio-sed branch) with simulations/unit tests; needs gluing, adding a few relatively simple things like FIFOs and timestamp counter, and testing on the board.
I think this sounds fine, provided that the number of FIFOs is perhaps 2-3x a "typical" number of timeline rewinds. I know this is a hard number to pin down, but maybe for now a reasonable first pass would be roughly 8 TTLs per FIFO? Will things need to be substantially more aggressive to achieve the resource efficiencies we desire with SRTIO?
8 TTLs per FIFO?
I should perhaps say 8 RTIO channels (not necessarily TTLs) per FIFO. Basically this would be an 8x reduction in the number of FIFOs relative to the current design. They would potentially need to be made deeper, but not 8x as deep.
This is just a straw man, not something I have calculated out in great detail. But I think the point is that we want to have enough FIFOs that using the arbitration scheme discussed above for stuffing events into FIFOs, we are very unlikely to end up blocking while we wait for FIFOs to clear. Since it's hard to say how many timeline rewinds will be in a given experiment in general, I think that it's perhaps easier to guess about what fraction of RTIO channels will experience timeline rewinds, and apportion the number of FIFOs in relation to the number of RTIO channels accordingly.
The general problem of figuring out how the system will behave under arbitrary experiments is very hard, and if you have an experiment which rewinds the timeline several times you can quickly run out of FIFOs. That's why I suggested this ratio as a (hopefully) somewhat conservative number of FIFOs to use. We have discussed things like 16 or 32 FIFOs above, but that's not really meaningful unless you put it in the context of how many RTIO channels those FIFOs are serving. Assuming they are for RTIO channel counts of the sort current implemented, the 8 channels/FIFO ratio seems generally in the right ballpark.
In the Oxford use case, from my understanding, it is possible to have nested timeline rewinds, such that each subsequent rewind goes back farther than the previous one -- this is a natural construction for many quantum information experiments, but ends up being very FIFO-intensive because each rewind will need to go to a new FIFO.
Local SED now works on the board, except that DMA systematically produces an underflow for some reason. And I want to look into reducing the output latency a bit.
The SED latency at the FIFO output is:
This makes a total of 9 cycles.
With traditional RTIO, there is only 1 cycle of latency at the FIFO output for comparing timestamps.
The latency at the FIFO input is reduced by 1 cycle, as the buffer/guard time mechanism for replacements is no longer necessary. So in total, SED increases the latency by 7 RTIO clock cycles.
@sbourdeauducq this is nice. However, to continue the discussion from #40, if there is automatic latency compensation, you will get very different FIFO usage depending on the order in which pulses in a with parallel
block are given (as the latency compensation might render them in order or out of order, after being applied). It seems that it would make sense, to the extent possible, to bring the with interleave
up to speed to help with this. I am afraid that very simple, casual choices (the ordering in which events in a with parallel
are listed) have the potential to create RTIO errors due to FIFOs filling and blocking. I know that your example in #40 shows that for just a few pulses, the FIFOs are able to handle things, but remember that if the number of FIFOs is significantly less than the number of RTIO channels, and if the experiment is complex (many RTIO channels on one device all firing at roughly the same time, especially if you are doing a DMA of something like sideband cooling for multiple ion zones in parallel, say), you may hit these sorts of limitations early.
My point with all of this is that the simple latency compensation in #40 with negative delays does not necessarily play nicely with SED, and I would advocate that the best way to handle things would be to modify the way the latency compensation is handled (e.g. to have it done in software before compile time, to the extent possible, with features like with interleave
) so that we avoid emitting out-of-order timestamps from the kernel when at all possible.
The current RTIO system uses one dedicated FIFO per output channel. While this architecture is fine for the first ARTIQ systems that were rather small and simple and even for single-crate Metlino/Sayma systems, it shows limitations on more complex ones. By decreasing importance:
This proposal addresses those issues:
Drawbacks are slightly increased complexity, and latency increased by a few coarse RTIO cycles by the output network (6 cycles, i.e. typically 48ns, for 8 FIFOs - increases with the square of the logarithm of the FIFO count).
The proposed SRTIO core contains a configurable number of FIFOs that hold the usual information about RTIO events (timestamp, address, data), the channel number, and a sequence number. The sequence number is increased for each event submitted.
When an event is submitted, it is written into the current FIFO if its timestamp is strictly increasing. Otherwise, the current FIFO number is incremented by one (and wraps around, if the current FIFO was the last) and the event is written there, unless that FIFO already contains an event with a greater timestamp. In that case, an asynchronous error is reported. If the destination FIFO is full, the submitter is blocked.
Alternatively, the gateware could look for any FIFO that can accommodate the event (is not full, and does not contain and event with a greater timestamp) and block the submitter until it succeeds. This allows for greater FIFO utilization than blocking on the current FIFO when it is full. However, the gateware has very little time to find an usable FIFO, to avoid undermining the performance of the submitter. This can cause timing problems if the number of FIFO is large.
At the output of the FIFOs, the events are distributed to the channels and simultaneous events on the same channel are handled using a structure similar to a odd-even merge-sort network that sorts by channel and sequence number. When there are simultaneous events on the same channel, the event with the highest sequence number is kept and a flag is raised to indicate that a replacement occured on that channel. If a replacement was made on a channel that has replacements disabled, the final event is dropped and a collision error is reported asynchronously.
Underflow errors are detected as before by comparing the event timestamp with the current value of the counter, and dropping events that do not have enough time to make it through the system.
The sequence number should be sized to be able to represent the combined capacity of all FIFOs, plus 2 bits that allow the detection of wrap-arounds.
The maximum number of simultaneous events (on different channels), and the maximum number of timeline "rewinds", are equal to the number of FIFOs.
The SRTIO logic should support both synchronous and asynchronous FIFOs, which are used respectively for local RTIO and DRTIO.
To implement flow control in DRTIO, the master queries the satellite for tokens. The satellite can use as a token count the space available in its FIFO that has the least such availability.