jordens commented 3 years ago

With Stabilizer we have a powerfull CPU sitting between ADCs, DACs and DDS channels. Since this is extremely generic it allows dozens of different use cases to be implemented. Implementing all or even many of them is tricky as there are lots of interdependencies, constraints and corner cases. The interactions can also lead to bottle-necks since they compete for the same resources (CPU time, latency). We need to come up with a clear set of high-value use cases and their requirement in terms of features to be implemented. Then we can decide whether features can be made available at compile-time or run-time and how that affects usability, testing and deployment. I'll try to start with the use cases here and then once the features become clearer, fork them into their own issues.

Use cases

IIR control: D-part, I^2 part, notches, LP, HP
WMS: Lock-in, Modulation
MTS: Same as WMS from Stabilizer perspective
FMS/PDH: Demodulation with (Pounder) and then it's either DC (nothing new) or superhet-to-DC (then MTS/WMS)
Phase lock: Lock-in R/phi, phase unwrap
Complex MIMO lock: Some beat input, demodulate superhet with Pounder, feedback amplitude error on AOM amplitude and phase error on AOM FM/PM and slow frequency error on laser piezo/current/temperature DAC output.

I/O Architecture

The acquisition and transmission of ADC/DAC/DDS samples should be pushed into the peripherals as much as possible
The ADC should run at max rate (allowed by peripherals and hardware), the processing interrupt can run at a submultiple.
ADC samples should end up in a CPU-accessible buffer (timer+DMA+SPI) at least as long as the relative ADC sample/processing rate ratio.
DAC samples should be take from a memory buffer (timer+DMA+SPI) at least as long as the relative DAC sample/processing rate ratio.
DAC sample emission should be triggered by the processing routine or by fixed timer.
DDS should transfer a memory block of register settings (one entire profile consisting of frequency, phase, and amplitude on all four channels) to the DDS chip via timer+DMA+QSPI at max throughput. Triggered by processing routine or timer.
Two-level modulation of the DDS would also be useful but is somewhat restricted by our design (P0,P1 on non-timer pins)
Digital I/O should also be handled by timer+DMA. This applies to the DDS SYNC_CLK PA0 (ETR and "reverse" timestamping of the free CPU clock), the digital front panel inputs (flags, timestamping).

Processing blocks

With the ADC, DAC and DDS data available in memory buffers for the CPU to consume/produce without delay and overhead, the processing should be routines called at configurable rate. The partitioning of the processing to the routines may reflect the graph partitioning (a low-latency single-biquad at high rate between one ADC-DAC pair and a slower demodulating multi-biquad at lower rate between another ADC-DAC pair). How each routine handles the ADC (one to multiple) samples and generates the DAC/DDS data (one to multiple) can be configured by linking up the processing blocks and configuring them. The data path flexibility may require compile-time configuration in some cases. In general there will be processing blocks that can be inserted and wired up in many different ways in the datapaths.

There may be some kind of configuration language involved here to describe the processing graph(s) and the settings.

TODO: boil all these down into a common DSP language

External modulation reference signal input for lock-in (really needed?, PLL or "edge timestamp wide-band"?) #60
Modulator with analytic waveform from ext-ref PLL or internal ref #60
Arbitrary (sample-based) waveform #60
DAC Modulator #60
FM/PM modulation through Pounder #60
Lock in detection (demodulation and filtering, phase choice, harmonic) #60 #80
FIR (internal, fixed F demodulation) fixed demodulation #60 #80
Dynamic demodulation with analytic waveform from PLL #80
Multiple biquad IIRs (typically in series, for anti-windup) #71
A big signal crossbar (IIR ins/outs, offsets, ADCs, DACs, modulations ...) to address muxing and routing, like redpid
Timestamping/counting digital inputs for #80, https://github.com/sinara-hw/Pounder/issues/76
A flag crossbar (hold #70 , rail, scan/relock/reset, digial i/o) like redpid
Scanning #86
Transfer function measurement: log sine modulator sweep, data streaming, and host processing #296
Relock (Algorithms from nist-digital-servo or redpid appear robust)
Data streaming (TBD: configure a UDP target via MQTT and open a fire hose of UDP data, constraints?) #150

Settings/Telemetry

tracking #149
Monitoring outputs (ADC, error, scan time base, R from lockin, signals and flags on the crossbar)
CLI network API as reference implementation
GUI

ryan-summers commented 3 years ago

I/O Architecture Design

A quick mock-up of the design: stabilizer

Background

There are two DMA peripherals available to us, each with 8 DMA streams, which allows for a total of 16 DMA streams. Each DMA stream has a configurable trigger, source/destination address, and transfer size.

Each DMA stream can operate in circular mode. This means that when a transfer is complete, the DMA configuration is automatically reloaded and can be triggered again without software intervention.

Each DMA stream can operate in double-buffered circular mode. This means that when a transfer is complete, the DMA configuration is automatically reloaded with alternating buffers A/B (where the new buffer is always the opposite buffer of the previous) and can then be triggered again without software intervention.

Below, streams are indicated by under 0-15, each streams 0-7 belong to DMA1 and streams 8-15 belong to DMA2.

ADC Inputs

In order to facilitate ADC sampling automatically, the following architecture is proposed:

DMA stream 0 and 1 each trigger off a timer to initiate ADC sampling. These set the CSTART bit in the SPI peripherals for the ADCs as currently implemented.
DMA stream 2 and 3 are each configured to read N samples from the SPI RX FIFO. When the transfer completes, it generates an interrupt for user software processing.
When user software processing begins, it immediately begins a new transfer into an alternate buffer for the next N ADC samples. After initiating the ADC sampling again, it processes the current N samples.

This method requires an update of the DMA dest + length registers (and an enable), which is non-zero DMA interaction. While it is possible to configure DMA to run continuously in a double-buffered mode, this wouldn't inform us of input overflow events.

This implementation would be to rely on the the 8 or 16 byte internal RX FIFO of the SPI peripheral. There is a small latency between the completion of the first N ADC samples and initiation of the transfer for the next N. The internal SPI FIFO should be sufficient to buffer this latency out.

With the above proposed configuration, we could use the SPI RX FIFO threshold interrupt as "ADC input overrun" detection.

DAC Outputs

Similar to the ADC inputs, the DAC SPI outputs operate with a timer configured to write the CSTART bit of the DAC SPI at a regular interval. This uses DMA stream 4 and 5 for both DACs.
Similar to ADC inputs, DMA streams 6 and 7 are used to write DAC samples to the SPI peripheral N samples at a time.
When user software has samples to send, it initiates a new DMA transfer to write all of the samples.

In this configuration, the SPI TX FIFO threshold can be used as an interrupt to detect output underrun events.

This method requires an update of the DMA dest + length registers (and an enable), which is non-zero DMA interaction. While it is possible to configure DMA to run continuously in a double-buffered mode, this wouldn't inform us of output underflow events.

Discussion

With the proposed architecture, DMA register updates (address + length) are required at the start of user data processing as well as the end (when staging output samples). While this is a non-zero amount of overhead, it does provide us with hardware detection of under- and over-flow events.

If we would like truly zero CPU overhead for input/output buffering, that can be accomplished by sacrificing hardware detection of under- and over-flow detection by using circular, double-buffered DMA streams (which is a possible DMA stream hardware configuration).

ryan-summers commented 3 years ago

As an expansion to the above, I believe we can have the input operate in double-buffered circular mode. In this configuration, user software can detect an over-run condition on the input by checking the DMA transfer complete flag after execution. If user software takes too long and the DMA transfer completes first, it means that the data has likely been corrupted. This can be used to supplement the SPI hardware overrun detection.

I do not currently think this is a feasible alternative for the DAC outputs because it's not clear how we could detect if uninitialized data was unintentionally written to the DAC SPI TX FIFO. The main issue is that we would always be theoretically racing to put data into the DMA output buffer right before it swapped over to start transferring it. There's sufficient jitter here that we might miss a sample every so often. I think it makes more sense here to rely on standard, single event transfers for the output.

dnadlinger commented 3 years ago

Just to throw it out there, regarding the "modulator" bullet point: Another use case that's quite important to us (i.e. already deployed) is to modulate the PI setpoint (i.e. input offset) from a PLL locked to a 50 Hz reference at a digital input, with a number of 50 Hz harmonics of configurable complex amplitude.

jordens commented 3 years ago

@dnadlinger Thanks. As you know there are dozens of different use cases. We now need to find people willing to work on this together. After throwing it out there, the next step would be for you to analyze your use case, break it down into components, see how the existing components match, and contribute the additional/improved concepts to a joint roadmap.

jordens commented 3 years ago

A couple points on your analysis above @ryan-summers :

Latency/Queue level: Yes, the buffers between processing routine and peripherals/DMA are queues. But low latency is important and therefore empty queue levels are required at certain points and are desirable. They are not useful as an global indicator of underflow. When the routine starts the input queues must contain a certain number of samples (preferrably not more and definitely not less) and the output queues must have a certain amount of space (ratio between routine period and sample rates), as close to empty as possible. When the routine is done, the output queues must have been fed a certain number of samples and the input queues read a certain amount. We need to verify that the notion and detection of overflow/underflow does not increase latency.
A unverified and rough idea: consider running the ADC SPI in slave mode and drive CS (and CLK bursts) from timer channels. That may remove the need for that weird CSTART gymnastics and free up DMA channels/bus traffic. The samples would just end up in the SPI RX buffer. Maybe similarly for DAC SPI. But maybe also the pins are not on timer output channels.
Other than these, I think we have a good common understanding of the I/O architecture.

dtcallcock commented 3 years ago

I just spotted this product: https://liquidinstruments.com/instruments/

Might give some high-level ideas.

ryan-summers commented 3 years ago

Support for the updated IO acquisition is implemented in https://github.com/quartiq/stabilizer/pull/165 - with the changes, it appears that there is substantially more time available for DSP-related activity for each individual sample.

It will likely take some time to stabilize the changes in our dependencies before merging this in, but that will provide us a lot of flexibility in terms of building applications on top of the general I/O interface.

quartiq / stabilizer

[meta] use case analysis #147

Use cases

I/O Architecture

Processing blocks

Settings/Telemetry

I/O Architecture Design

Background

ADC Inputs

DAC Outputs

Discussion