Closed jordens closed 8 years ago
As mentioned in #12, slow parallel digital buses to daughtercards to be sent as high-speed differential over these connectors using standalone SerDes chip pairs e.g. http://www.ti.com/product/DS92LV3241. In this instance might follow the deserializer with a buffer to increase drive power, potentially with an output enable controlled by deserializer LOCK pin. All signals through connector to be high-speed differential.
Here are my notes from April, with some modifications (check for errors):
16x (8 each) differential pair – JESD204B data lines 2x (1 each) differential pair – SYNCOUT line for JESD204B synchronization 3x single-ended lines – SPI CLK/MOSI/MISO 2x single-ended lines – SPI chip select 1x differential pair – DAC clock to FPGA (the SYSREF signal for deterministic latency will be generated on the RTM, but the DAC clock also needs to be fed back into the FPGA to ensure that it clocks data out at the correct rate) this may not be necessary given the current Sinara clocking scheme, TBA
8x (4 each) differential pair – JESD204B data lines 2x (1 each) differential pair – SYNCb line for JESD204B synchronization 3x single-ended lines – SPI CLK/MOSI/MISO (can be shared with DAC SPI in a pinch) 2x single-ended lines – SPI chip select 1x differential pair – ADC clock to FPGA (to give input clock for JESD204B at FPGA) again may not be necessary with current Sinara clocking plan 1x differential pair – SYSREF to FPGA (allows absolute time synchronization)
So we are looking at a total of:
Assume 4 analog daughtercards, each with 16 digital IO lines, for a total of 64 total digital IO (taking up 32 differential pairs’ worth for direct transmission).
This gives a total payload of 68 differential pairs’ worth for a fully loaded RTM. If you use the spec’ed connector from the standard that have 40 differential pairs each (pp.12-13 of https://www.slac.stanford.edu/grp/lcls/controls/docs/hw/uTCA/Standards/PICMG_MTCA_4_R1_with%20errata_SLAC.pdf), you get 74 differential pairs available (6 pairs are reserved for other purposes in the spec). The connector is here http://www.te.com/usa-en/product-6469002-1.html, and is rated for signaling up to 12 Gbps, which is suitable for our purposes. It's designed for differential signals.
So now you have 6 differential pairs left over for additional signaling to the RTM. Another option would be to use a SERDES chip on the RTM, allowing you to send commands that are not as time-critical using a single high speed differential pair (IOSERDES is still OK for this, no need for a gigabit transceiver), which would then be fanned out as a slow parallel bus on the RTM card using the appropriate receiver. It might be simpler from a gateware standpoint to just use a chipset Ser/Des pair, such as detailed here: http://www.ti.com/product/DS92LV3241, which turns 32 LVCMOS channels at 40-85 MHz bus speed into 4 differential pairs with embedded clock (with scrambling and DC balancing for EMI reduction). This chipset does all clock recovery/alignment/calibration internally, so the FPGA could just output as normal LVCMOS RTIO outputs, provide a suitable clock (ideally a submultiple of the RTIO clock) as reference for the SERDES chips, and the pulses would be output transparently at the daughtercards on the RTM with deterministic latency.
This would let us send 64 digital IO lines to the daughtercards using 8 differential pairs only, which would free us up to either offer more digital IO on the daughtercards and/or to have more lines free for other control/communications stuff, or future applications. The serialization/deserialization latency on this chipset is deterministic, which is necessary for our application. Also, sending balanced differential signals over the RTM connector is probably going to be better for crosstalk than large-swing single-ended LVCMOS.
We can debate whether to send the single-ended signals for the DACs/ADCs (e.g. SPI) over a SERDES as well, or to use LVDS transceivers to make them differential for reduced crosstalk, or to just send them through as LVCMOS.
Why would you need fast IO for mezzanines? Wouldn't be SPI extenders enough?
I’d route via RTM connector general purpose SPI + mux address lines for chip selects to serve all possible SPI chips directly
Then add I2C with muxes routed to all mezzanines
And only for general purpose IOs that may need MHz signaling rate, use SERDES chips.
On each mezzanines we would need to define SPI and I2C lines + some general purpose IO.
SPI or I2C over serdes could be an overkill.
We can also use DS92LV3221 / DS92LV3222 which require only 2 diff lines per chip. We can connect them directly to the FPGA and make serialization in logics. Otherwise we will loose plenty of IO pins
The analog daughtercards/mezzanines need to have "fast" digital IO, ~50 MHz bandwidth. We will want to program digital attenuators over SPI on the fly for fast range switching, and we also need to be able to turn on/off the RF switches at precise times. I think we should have all lines going to the mezzanines capable of this kind of bandwidth (with the exception of I2C lines) for maximum flexibility. Speeds only need to be as fast as typical switching times/clock speeds for the devices. RF switches go in about ~10-20 ns. HMC542BLP4E attenuator has a maximum serial clock speed of 30 MHz.
I would not distinguish between SPI and GPIO on the mezzanine connectors; this can be reconfigured in the AMC FPGA bitstream to match the needs of the particular mezzanine. For example, some mezzanines might want to use a serial-programmed digital attenuator but others a parallel-programmed one (because their other specs are more appropriate for a given application, say). If we just make all lines (except I2C, see below) bidirectional with min 50 MHz bandwidth, it should cover all use cases I can envision.
Because there will be a number of potential SPI devices on the RTM (DACs, ADCs, PLLs, plus chips on the mezzanines), I think it would be wise to route multiple separate SPI buses from the AMC so that programming can happen in parallel. In particular, digital upconversion in the DACs and digital attenuator values on the mezzanines will likely need to be changed within a short period of time in a typical experiment. It would be a shame to build this fancy hardware only to have it choke on needing to wait for an over-shared SPI bus. I think a dedicated SPI bus for each of the two DACs makes sense at a minimum, with a shared SPI bus between each ~pair of mezzanine cards (instead of one for all four). Then one more SPI bus can be shared between ADCs, PLLs, and other devices, where there will not be a need for rapid timing-critical reprogramming.
For I2C, which is slow, I think we can get away with a single I2C bus that goes through the RTM connector and is split out using an I2C bus switch (http://www.ti.com/product/PCA9548A) to talk to the different mezzanine cards. The driver for this I2C bus switch is already written as a part of ARTIQ. We would pick two pins on the mezzanine connectors to dedicate to SDA and SCL.
@dhslichter The clocks you mention are necessary, though we can look into generating the "ADC/DAC clocks" (in fact the reference clocks for the transceivers) on the AMC if we are really out of RTM pins. In addition to those, we may want another SYSREF for the DACs (it is not clear whether we always want the same SYSREF for DACs and ADCs, so better keep options open) plus at least one backup clock line.
OK, so we route the I2C tree with addressable muxes. Can we simplify the things and make the mezzanine IO pins with defined direction? The 32-bit SERDES chips both 2 and 4 lane have defined direction. Even if you combine the two, the direction is still fixed. One would need to add the 3-state buffers and another register for direction control. And another SERDES chip for readout of the IO. ANd on the FPGA side, still another SERDES with inputs. So this complicates a lot. If we define 8 input pins and 8 putput pins per mezzanine, this would simplify a lot. From FPGA side we can use IOSERDES for transmission and IOSERDES with external clock for reception, so CDR won't be needed. If we use pair of SERDES chips on both siides, we'll loose plenty of FPGA pins and in this way either we won't have additional SDRAM or the FMC. Since SERDES mechanism is transparent from user point of view and with fixed IO direction, we can implement SPI on top of it. I2C must be done separately but we have dedicated pins on RTM for that.
I suppose that those SERDES chips will introduce latency non-determinism in SPI transfers. Are they really necessary?
why they would add non-determinism? They are very simple devices, without buffering, FIFOs and so on.
How does the receiver perform comma alignment? Does it program the divider at the output of the CDR that produces the character clock, or does it keep whatever phase the divider generates and reshuffles the bits? The latter mechanism produces a latency non-determinism (across reboots) of the duration of one character.
One more thing - for ADC and DAC SPI config, one can think of I2C - SPI bridges. I use them in some designs to simplify the startup configuration scheme http://www.nxp.com/products/interface-and-connectivity/interface-and-system-management/bridges/spi-slave-to-i2c-master-gpio-bridges/i2c-bus-to-spi-bridge:SC18IS603IPW
We can also add a small FPGA (e.g. 6SLX4 or Lattice) on the RTM side to handle SPI-over-SERDES.
Yes, we can do that easily. But it is another chip to program and to maintain. From my point of view it is simpler. If you thing it is simpler for you, let's go for it. In case of the SERDES they don't mention any non-deterministic latency http://www.ti.com/lit/ds/symlink/ds92lv3241.pdf
For this FPGA we would also deliver the clock. In this way we'll get rid of any non-determinism. How would you want to programm this FPGA? Would it have its own FLASH config or would be loaded over JTAG from AMC FPGA?
We can design deterministic receivers in FPGAs that do not require a clock, but having the clock makes a CDR unnecessary (so we don't need the more expensive FPGAs). Loading via JTAG or some other serial protocol from the AMC sounds good.
I would actually prefer another protocol (e.g. Xilinx "slave serial", I don't know if Lattice has an equivalent) since then we can connect a debug probe to JTAG.
In our designs we use Xilinx engine that converts bit file to the JTAG commands. We lack pins in the RTM connector and JTAG is there by default.
For inter-chip communication we used Aurora IP core from xilinx. I needed to transfer 512 signals with 4MHz update rate between 2 chips. We can imagine similar situation here, but the update rate would be higher. How many LVDS links do you need between the two FPGAs?
If we use the TI chips, is there even a way to connect the serial sides to the Sayma FPGA directly that does not involve reverse engineering or unfriendly paperwork? I cannot find documentation for the protocol.
For reference...
As discussed on the phone today:
Initial A7 gateware version will probably use a dumb protocol similar to drtio_transceiver_demo.
We can connect the RTM FPGA to the Kintex US config memory and configure them together in one process. Once Kintex finishes config it starts Artix config from same memory. Number of pins utilized is similar but we keep same memory and same update scheme. http://www.xilinx.com/support/documentation/user_guides/ug570-ultrascale-configuration.pdf page 193 it says "UltraScale™ FPGAs can be daisy-chained with earlier 7 series families." so in theory this should work
One more thing - for ADC and DAC SPI config, one can think of I2C - SPI bridges. I use them in some designs to simplify the startup configuration scheme
I would prefer to keep all SPI buses with the ability to run at the maximum speeds allowed by the various chip specs, at least for the DAC configuration. PLL/ADC is less time critical.
It's already decided - we connect all the stuff to the ARTIX FPGA on RTM.
Great sounds good.
I would like to log my doubts about the usefulness of the "dumb protocol from drtio_transciever_demo". It's speed scaling with the number of pins, the implication for the time alignment as well as the dead-ended-ness of said "dumb protocol" were brushed away quickly.
@gkasprow Please put another Si5324 on the 7A15T, connected as usual with one output to the IBUFDS_GTE2 and the other output to a clock-capable general-purpose input.
I see three problems with daisy-chaining:
Daisy chaining is simply putting the downstream FPGA in slave serial mode and using hardwired logic in the upstream FPGA to send its configuration data. I would simply ignore that hardwired logic and implement the functionality in the fabric instead, which avoids the above problems. Please connect the second FPGA accordingly. You may want to add jumpers or 0-ohm resistors that connect DONE/INIT_B/PROGRAM_B for daisy-chaining so that the hardware can still support that option. CCLK is accessible via STARTUPE3 and DOUT is a general-purpose I/O.
And 32 bit addressing (required for flashing more than the KU040 bitstream) is also not supported or planned in neither openocd nor misoc/runtime.
OK, so I will add jumpers that either will enable direct RTM Artix config from Kintex IO pins of daisy chain config with Kintex as master and Artix as slave.
Here are the questions that I jotted down during Friday's discussion of related to putting FPGA on RTM.
OK, So I placed SI5324 and connected one clk output to the MGTREFCLK1 and another output to the MRCC_14 Where do you want to have the CLK1 and CLK2 of the Si chip connected?
I need only one clock input and connected to a general-purpose I/O of the FPGA.
This is handled by m-labs. Resolved. Doesn't impact hardware.
That clock input must be CLKIN1, the current schematics do the right thing.
Summarize number and types of signals on RTM. <<< THIS IS A DRAFT, @jordens, @gkasprow, @sbourdeauducq please review >>>
Xilinx Artix7 (A7)
AD9154
AD9656
low-noise power
TODO