streaming samples from memory to DAC output

dhslichter commented 8 years ago

For some applications, the spline interpolator/CORDIC architecture is not sufficient to generate DAC output waveforms of suitable complexity or speed. For example, pulses with shaped edges of short duration (edge times <10 ns), or pulses containing more than two frequency components, cannot be constructed by the presently planned gateware.

It would extend the power of the design considerably if it were possible to precompute DAC output waveforms to be stored in RAM on the Sayma card, which could then be played back at specified times. Such capability would be of use to solid-state quantum information systems, as well as for near-field microwave gates in trapped ions. It would also increase applicability of the Sayma hardware to the wireless test/radar/SDR world, as this would allow realistic data transmission tests etc.

Ideally, I would envision a hybrid mode of operation where the data pipeline to the DACs could be fed either from the interpolator/CORDIC cores or from a direct sample stream from RAM, with a switch that allows changing between these two sources on the fly at specified times. Basically if you have a FIFO for the JESD204B output, you would have a switch that chooses the data source for feeding this FIFO, either from a memory source or from the interpolator/CORDIC core. The memory source could have a FIFO before this switch to allow for latencies etc in the DMA. The interpolator/CORDIC engines run all the time, and the samples they generate are either passed through into the JESD FIFO (if they are switched in) or simply dropped (if the memory is switched in).

If this is too complex, it would be suitable to have the switch between data sources not occur with precise timing (but with some "small" nondeterministic jitter allowed), but with the addition of a mechanism to play a waveform from memory and then hold its last value after the waveform stops until a specified time when the next waveform is played.

Another feature of great utility would be to enable chaining or looping of pre-recorded waveforms from memory into the DAC data pipelines. This is the typical behavior of commercial hardware AWGs, e.g. from Tektronix.

To be worthwhile, the streaming should be able to occur at up to 1 GSPS (16 Gbps) on at least 2 of the 8 output channels (more would be better if possible, e.g. 4 channels, which would require 64 Gbps, thus saturating the rate for the current memory controller on the KC705). One could potentially back off the sample rate slightly (e.g. to 800 MSPS) to give headroom if need be. Another compromise would be to stream samples of reduced bit depth to more channels (4 channels @ 1 GSPS, but with 12-bit samples and the 4 LSBs zeroed for each channel), if this eases memory bandwidth requirements (although it may not, depending on how samples are stored in memory).

Waveforms could be precomputed on a computer and downloaded to RAM ahead of time. Alternatively, waveforms could be precomputed at lower-than-realtime speed by the interpolator/CORDIC cores on the Sayma card (i.e. for a 1 GSPS waveform, computing sample points at 200 MHz using the gateware, the playback would occur at 5x the speed of generation, and each time step in the interpolator/CORDIC would correspond to 1 ns instead of 5 ns), stored in memory, and played back at full speed on demand. This method (which relies on downtime for the interpolator/CORDIC cores, which may not be the case) could ease some of the communications bandwidth requirements for transferring waveforms from the Metlino or PC, and give more autonomous/distributed operation. Alternatively, waveforms could also be precomputed by an accessory hard CPU on the FMC connector of the Sayma card.

jordens commented 8 years ago

It would be great if you could first determine the requirements and not start with prescribing an implementation. Your long text makes way too many assumptions and implications about how we would implement this, about what's easy and what's impossible. It's really hard for me to comment on this in any sensible way as we'd have to unwind everything, determine the actual need, and then start again. We also think that the implementation that we laid out a long time ago (in response to a rough set of requirements) is simpler, more generic, and more flexible than what you appear to be implying here.

dhslichter commented 8 years ago

Here are the requirements and motivation:

need to create shaped pulses with time-dependent complex amplitudes for microwave qubit manipulation.
for superconducting qubits and microwave gate ion traps, total pulse durations for relevant operations (e.g. pi/2 pulse) can be as brief as 10-20 ns.
to reduce leakage to neighboring qubit levels, the amplitude/phase (or I/Q) envelope of these pulses needs to be shaped with ~1 ns resolution. Rise/fall times can be as short as ~3-4 ns.
for short (10-20 ns) pulses, one typically need only specify an I/Q envelope (i.e. zero IF). For longer pulses (50-250 ns), pulses may be frequency-shifted (nonzero IF) in addition to having an I/Q envelope. For some multiqubit gates, or for multiplexed readout of superconducting qubits, simultaneous shaped pulses at two or more frequencies may be required on the same microwave line.
multiple different pulse shapes/durations and frequencies are needed to realize a full desired set of gate operations. It must be possible to emit different pulse shapes and/or frequencies with minimal (~10-20 ns) time lag between the end of one pulse and the start of the next pulse.
for randomized benchmarking applications, it must be possible to emit predefined sequences of hundreds or thousands of such pulses in succession while maintaining short pulse times (~10-20 ns) and short delay times between pulses (~10-20 ns).
the time at which the DAC emits the first sample of any pulse must be deterministic. It is acceptable to have a repeatable (from reboot to reboot), deterministic time offset between a "pulse start" time specified by the user in the ARTIQ kernel and the time at which the first sample of the pulse is emitted by the DAC.
carrier frequencies are typically between ~1-10 GHz for these pulses.

I appreciate the capability of the current Sayma design, which is substantial, but would it be capable of meeting these requirements?

jordens commented 8 years ago

Thanks. That's a good specification. We have been keeping in mind such a sample-based RTIO channel and its rough design while we are developing (D)DMA, DRTIO, and the Sayma waveform parametrization. In that sense, yes: the current Sayma design supports the development of such a feature. We are inviting interested parties to verify/modify/amend the requirements and fund this development.

In short, we'd just implement an additional RTIO output channel that takes N (let's say 8) samples per RTIO clock cycle (let's say 8 ns) and adds them onto the DAC samples, just like the DC spline is added to the oscillators. Either before the GHz DUC (digital up-converter) or after.

The current Sayma design includes a wide DDR (@gkasprow, how wide?) to maximize the sustained data rate.
Repetition/looping of DMA segments is something that should be kept in mind for the DMA implementation.
The double-indirection that you refer to is a bit trickier but sounds doable: a DMA segment (the RB list) of references to other DMA segments (the individual pulses).

gkasprow commented 8 years ago

We use 64 bit DDR3 running at 800MHz.

dhslichter commented 8 years ago

@gkasprow by 800 MHz, do you mean the SDR clock or the DDR clock? I assume that you mean the SDR clock, so the total maximum theoretical data rate would be 64 bits * 1.6 Gbps = 102.4 Gbps?

gkasprow commented 8 years ago

800MHz clock rate, so the data rate is 1.6Gbit/s * 64

gkasprow commented 8 years ago

yes

On 29 September 2016 at 20:34, dhslichter notifications@github.com wrote:

@gkasprow https://github.com/gkasprow by 800 MHz, do you mean the SDR clock or the DDR clock? I assume that you mean the SDR clock, so the total maximum theoretical data rate would be 64 bits * 1.6 Gbps = 102.4 Gbps?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/m-labs/sayma/issues/6#issuecomment-250552913, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-vurrQlKsMLjZqhVF-Rnw6bsPtrQSks5qvASugaJpZM4KIQ-Q .

dhslichter commented 8 years ago

In this case, one should be able to stream 16-bit samples at 1 GSPS to at least 4, and perhaps 6, DAC channels. This is sufficient for most purposes I would imagine; the remaining 2 (or 4) channels can be just the usual spline/CORDIC generators. I like the idea of having the sample streamed from memory simply added to the samples generated by the interpolators, since it covers all the use cases I can think of in a simple and flexible way.

The repetition and looping (as well as the lookup table) seem to me to be fairly key components of a successful implementation, so I am glad you think they would be doable (if tricky).

jordens commented 8 years ago

Potentially. But don't be fooled by calculating too tightly:

There will be some overhead in the DMA data format.
Jumping around in DRAM (for e.g. the RB table) can come at a hefty cost in throughput and latency.

gkasprow commented 8 years ago

I have one question. Would it be more suitable for this purpose to have 2 banks of 32bit SDRAM or single bank with 64bit memory? 2 banks give additional flexibility - one can use the memory controllers for different purposes. The drawback is more difficult layout and more FPGA pins used. It also uses much more logic resources

On 29 September 2016 at 23:18, dhslichter notifications@github.com wrote:

In this case, one should be able to stream 16-bit samples at 1 GSPS to at least 4, and perhaps 6, DAC channels. This is sufficient for most purposes I would imagine; the remaining 2 (or 4) channels can be just the usual spline/CORDIC generators. I like the idea of having the sample streamed from memory simply added to the samples generated by the interpolators, since it covers all the use cases I can think of in a simple and flexible way.

The repetition and looping (as well as the lookup table) seem to me to be fairly key components of a successful implementation, so I am glad you think they would be doable (if tricky).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/m-labs/sayma/issues/6#issuecomment-250594465, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-vlGV7fKR5BWHFDXr9De8N-LrE02jks5qvCs6gaJpZM4KIQ-Q .

dhslichter commented 8 years ago

I think 4 channels is a sensible number, and still provides a good enough price point per channel. I realize that 6 channels would be pushing it (thus my modifier "perhaps"), given how close it is to the absolute maximum theoretical bandwidth with no consideration of the other issues you mention.

It seems to me that if we have two banks of 32 bit SDRAM, each one will only be able to service 2 (or perhaps 3) channels, so you would end up having to use both for 4 channels, and then you are back to where you were with the 64 bit SDRAM where the streaming of DAC samples is competing with other tasks for use of the memory. Based on that, I say stick with the single 64 bit bank, especially if this simplifies layout and decreases FPGA resource usage. Others may feel otherwise!

A few comments:

It seems that we will probably want a decent-size FIFO on the FPGA to buffer the sample data being read from SDRAM, to smooth over the hiccups in memory reading.
To combat the throughput/latency costs of jumping around in DRAM, I would suggest that tasks such as randomized benchmarking could be performed by saving longer strings of pulses in SDRAM so that the jumping is less costly as a fraction of the total time. Most people run in this manner anyway rather than jumping around on the fly to construct RB sequences.
Another thought: the RB sequences would likely be composed of gates that are so short (~20 samples each) that one could put the individual pulses into block RAM and then have the data sent to memory by jumping around in there. This would mitigate the SDRAM read/latency issues, and additionally would free the memory controller for other tasks once the pulses have been downloaded to block RAM. I realize this is not as simple to implement. It would involve some kind of understanding of how much total "original" pulse data there is, out of which the sequence will be constructed by jumping/looping. If this amount is sufficiently small, then one would preload into block RAM and jump from there, otherwise it would be streamed from SDRAM. Thus large payloads (where one would expect, barring pathological cases, that the individual pulse sequences are relatively long and thus the jumping overhead would be fractionally lower) could come from SDRAM, but small payloads with lots of jumps could live in block RAM and not deal with the latencies and/or hog the memory controller.

sbourdeauducq commented 8 years ago

The extra logic resources used by an additional bank shouldn't be very high (~kLUT at most) but it does make the layout more complex and uses IO. Is it possible at all to have more than 64 bits (e.g. 128), or will we run out of HP IO pins and/or exceed the maximum fanout for the command/address bus?

gkasprow commented 8 years ago

I will answer this in a few days once I finish the schematics and connect all mezzanines.

On 30 September 2016 at 05:26, Sébastien Bourdeauducq < notifications@github.com> wrote:

The extra logic resources used by an additional bank shouldn't be very high (~kLUT at most) but it does make the layout more complex and uses IO. Is it possible at all to have more than 64 bits (e.g. 128), or will we run out of IO pins and/or exceed the maximum fanout for the command/address bus?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/m-labs/sayma/issues/6#issuecomment-250650176, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-vkFpD6Yp4yZSRbZV9orz1hHvuKLZks5qvIGCgaJpZM4KIQ-Q .

dhslichter commented 8 years ago

If 128 bits is too wide, could one do a 64-bit SDRAM (which could optionally be dedicated for pulse streaming if desired) and a 32-bit SDRAM?

jbqubit commented 8 years ago

@jordens @sbourdeauducq Is this specified tightly enough for m-labs to generate provisional specification and a cost estimate?

jordens commented 8 years ago

Yes. Is there interest and funding? We'd like to not work on this right now and delay writing the specification and the quote a week or so.

dhslichter commented 8 years ago

I think the main issue for the present time has been addressed, namely how wide the SDRAM bus should be. If we have agreed on a 64-bit SDRAM plus a 32-bit SDRAM, this should suffice for allowing the kind of streaming discussed above while keeping a separate memory bus available for the soft processor etc. I agree with @jordens that it would be good to delay writing the spec at this stage unless there is funding ready to go with a deadline.

sinara-hw / sinara

streaming samples from memory to DAC output #6