Closed hartytp closed 7 years ago
Do you mean those aluminum core PCBs or do you mean proprietary IP blocks in the gateware/silicon?
The latter -- although, I was thinking more along the lines of an external ARM core connected to the FPGA (IIRC, this was discussed previously)
This is m-labs/artiq#535. There are numerous issues that make "putting hard cores" onto those boards very difficult. An incomplete list:
a) Everybody says "I want an ARM core" but we've always had problems figuring out exactly why or how they want to use it. b) They have high latencies. c) Zynq-style silicon also takes away ethernet, RAM etc. from the fabric. That sabotages our design. d) It's unclear whether our dual CPU architecture translates at all. e) It's too late in the process to incorporate it. f) Gateware accelerated solutions for those DSP applications are likely better solutions for the problems at hand.
But yeah. If somebody still wants them, we are open to discuss the details.
m-labs/artiq#535 is about adding floating point instructions to the gateware mor1kx.
As for putting hard CPU cores on the Sinara hardware, they would go on FMC with something like this: http://www.4dsp.com/FMC645.php But yes, someone has to figure out how to interact with it and what it is supposed to do.
@jordens to answer (a) here is the problem I am worrying about - if there is a solution that does not involve a hard core I am equally happy with that.
Many of the experiments I am writing at the moment are running into pulse rate limitations. For release 2.0 the sustained pulse rate for a single TTL (as per the PulseRate test) is ~600ns. This means that by the time one has a little more control flow in the loop, and uses 5-10 TTL, the minimum loop execution time is ~10us.
To give an example of a problem we have in the lab right now: We want to spin in a tight loop of state-prep, excite, and branch if there was a detection event, otherwise repeat. This involves ~10 TTL signals. Our apparatus allows us to repeat this loop at a rate of ~1us, but we are limited to >~10us by ARTIQ.
Even for experiments that do not involve frequent branching, and hence allow the processor to fill up the FIFOs during the slack times, we often run into underflow errors which require padding delays.
These problems are only going to get worse as experiments get more complicated, and require more branching, with more complicated decisions at the branch.
My understanding (correct me if I am wrong) is that we can push the soft-core from the current 125 MHz to perhaps 200 or 250 MHz, but no further - to go faster than this we would need a hard CPU.
Suggestions?
This FMC DSP module requires HPC board and has either EMIF or gigabit SRIO interface.
For deterministic access EMIF seems to be better choice but to integrate this module with Sayma we would need detailed datasheet to check which pins are used for EMIF.
On the other side we have experience with FPGA-DSP communication over SRIO but here transceivers are necessary which we don’t have connected to FMC.
The DSP card is just an example of a CPU mounted on a FMC, I'm not proposing this particular one.
@hartytp
One can use ARM CPU with large on chip SRAM which should be more deterministic than external SDRAM. Providing that 10MB of memory is enough.
https://www.renesas.com/en-eu/products/microcontrollers-microprocessors/rz/rza/rza1h.html
@jordens
Are there ways ARTIQ performance could be improved (via changes to gateware or additional hardware) other than through DMA or application-specific gateware acceleration? I'm not saying there's a need for it or that it's a sensible thing to do, I just want a sense of what it'd involve.
AFAICS there are a bunch of different performance areas with different solutions:
Potentially, there is also the option of a tighter coupling of the RTIO core with the CPU, so it doesn't have to program all the RTIO CSRs through external bus transactions at every event.
Yes. That's what I meant with "improved register layout and tweaked event submission code".
@jordens The issues I am primarily worried about are:
Raw event rate: I am happy with using DMA to fix the occasional tight part of a sequence. However I am already having to add padding delays all over the place in my current experiments to fix the timing. I definitely do not want to end up using DMA for everything - this is conceptually messy, and pushes more work to the user to manage the flow.
Maths - I agree that this could be handled by an external processor / FPU, modulo the complexity to the user to mark up where different bits of code need to run
Reaction latency: A lot of this is in gateware, which can be improved. However, for all but the simplest use cases one needs to do some maths, which hurts currently.
I understand that the difficulty with a hard CPU is getting a low latency / high bandwidth interface, which pushes us to either an external co-processor (with potentially high latency) or a FPGA with a hard core (e.g. Zynq).
It seems like there are firm advantages to using an FPGA with a hard core (low latency, decent maths performance, no need for nasty 'offload CPU' complexity for the user). Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq?
This obviously involves significant changes to the gateware, but it feels that this is not grossly different from the effort required to write a soft FPU, or write a mechanism to pass jobs to an offload CPU.
What am I missing?
@jordens Would you be willing to write a for-contract issue over on the artiq repository for "improved register layout and tweaked event submission code"? I agree with cjbe that it'd be nice not to have to use DMA everywhere to get a raw event rate out of artiq that's adequate for the bulk of experiments. It's not clear whether 'factors of a few' will be enough but I feel it might be so it seems worth exploring.
I definitely do not want to end up using DMA for everything - this is conceptually messy, and pushes more work to the user to manage the flow.
The last point can be improved if the compiler extracts and pre-computes DMA sequences, either automatically or with another context manager that you simply wrap around the sequence (no manual/explicit preparation and replaying of DMA segments).
Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq?
As mentioned or hinted above:
which pushes us to either an external co-processor (with potentially high latency)
If the external processor has a good (synchronous, high bandwidth, low latency) external bus interface, it could potentially run the kernels better than a Zynq core does.
Low-latency is unclear
With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010. This was in a different setting without any care taken to optimise the code, but it might be useful as an upper bound.
I believe ETH are getting 70-80 ns overall branching latency on a Zynq 7010 as well (end of TTL input window -> AXI -> CPU -> AXI -> TTL). The same caveat applies, though; I'm pretty sure minimising that has not been a focus there either.
the current two-CPU system (non-realtime comms and management + real-time kernel) may not be doable
Why would it not be?
Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed
The Xilinx drivers already exist (and are solid, if generally meh).
2) the compiler will have to produce ARM instructions
Trivial. The only interesting part would be the finer details of matching the C ABI chosen.
With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010.
Ok, for this simple task, the soft CPU may not be that different. How many bus transfers per edge was that? The RTIO core needs quite a bit, plus a read after the event is posted to check the status (underflow, etc.) that incurs a full bus round trip.
Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed The Xilinx drivers already exist (and are solid, if generally meh).
Meh, yes, and will they integrate well with the rest of the code?
the compiler will have to produce ARM instructions Trivial. The only interesting part would be the finer details of matching the C ABI chosen.
Yes, that plus the usual collection of bugs and other problems that manifest themselves every single time you use software (the ARTIQ compiler, LLVM, and the unwinder) in a way it has not been used before.
@cjbe @dtcallcock I would first like to see a diagnosis and profile what is actually slow and why. This renews our request from a few years ago to see test cases and actual code. This does not mean that the improvements above are not good or unneeded. It's just to ensure (and do so in a CI fashion) that there are no bugs/obvious fixes that would improve things.
With little effort I had gotten around 120 ns (IIRC) of TTL round-trip latency with my old ventilator code which did hard timestamping. I have no idea how much tweaking the ETH guys applied and whether this was actually RTIO-like. They don't seem to publish their code.
If you want to use ZynQ some time in the future, I prepare HW which is essentially Sayma AMC but with ZynQ US+ chip It will have second FMC instead of SFPs but up to 4 SFPs can be installed on FMC. I will keep RTM compatibility with Sayma AMC. This board will be used for another project related with video processing, but can be used for ARTIQ as well.
Closing this issue as: adding Zync etc to Sayma/Metlino is impractical at this point; and, many of the above concerns should be dealt with by DMA...
What is the current thinking on putting hard cores on the Metlino/Sayma/etc.?