Hard cores - Githubissues

hartytp commented 7 years ago

What is the current thinking on putting hard cores on the Metlino/Sayma/etc.?

jordens commented 7 years ago

Do you mean those aluminum core PCBs or do you mean proprietary IP blocks in the gateware/silicon?

hartytp commented 7 years ago

The latter -- although, I was thinking more along the lines of an external ARM core connected to the FPGA (IIRC, this was discussed previously)

jordens commented 7 years ago

This is m-labs/artiq#535. There are numerous issues that make "putting hard cores" onto those boards very difficult. An incomplete list:

a) Everybody says "I want an ARM core" but we've always had problems figuring out exactly why or how they want to use it. b) They have high latencies. c) Zynq-style silicon also takes away ethernet, RAM etc. from the fabric. That sabotages our design. d) It's unclear whether our dual CPU architecture translates at all. e) It's too late in the process to incorporate it. f) Gateware accelerated solutions for those DSP applications are likely better solutions for the problems at hand.

But yeah. If somebody still wants them, we are open to discuss the details.

sbourdeauducq commented 7 years ago

m-labs/artiq#535 is about adding floating point instructions to the gateware mor1kx.

As for putting hard CPU cores on the Sinara hardware, they would go on FMC with something like this: http://www.4dsp.com/FMC645.php But yes, someone has to figure out how to interact with it and what it is supposed to do.

cjbe commented 7 years ago

@jordens to answer (a) here is the problem I am worrying about - if there is a solution that does not involve a hard core I am equally happy with that.

Many of the experiments I am writing at the moment are running into pulse rate limitations. For release 2.0 the sustained pulse rate for a single TTL (as per the PulseRate test) is ~600ns. This means that by the time one has a little more control flow in the loop, and uses 5-10 TTL, the minimum loop execution time is ~10us.

To give an example of a problem we have in the lab right now: We want to spin in a tight loop of state-prep, excite, and branch if there was a detection event, otherwise repeat. This involves ~10 TTL signals. Our apparatus allows us to repeat this loop at a rate of ~1us, but we are limited to >~10us by ARTIQ.

Even for experiments that do not involve frequent branching, and hence allow the processor to fill up the FIFOs during the slack times, we often run into underflow errors which require padding delays.

These problems are only going to get worse as experiments get more complicated, and require more branching, with more complicated decisions at the branch.

My understanding (correct me if I am wrong) is that we can push the soft-core from the current 125 MHz to perhaps 200 or 250 MHz, but no further - to go faster than this we would need a hard CPU.

Suggestions?

gkasprow commented 7 years ago

This FMC DSP module requires HPC board and has either EMIF or gigabit SRIO interface.

For deterministic access EMIF seems to be better choice but to integrate this module with Sayma we would need detailed datasheet to check which pins are used for EMIF.

On the other side we have experience with FPGA-DSP communication over SRIO but here transceivers are necessary which we don’t have connected to FMC.

sbourdeauducq commented 7 years ago

The DSP card is just an example of a CPU mounted on a FMC, I'm not proposing this particular one.

jordens commented 7 years ago

@hartytp

You seem to be silently assuming that branching latency is inversely proportional to CPU speed. This is most likely not the case. Branching is impacted by branch prediction and DRAM latency.
Your case seems to be well suited for DMA. Just prepare a DMA segment for those 10 TTL pulses and replay them using < 1 µs CPU time.
It is also unclear whether a hard CPU would help you at all. The timing could be limited by the way we have designed the RTIO register interface. And in the case of a hard CPU, the fabric-to-CPU bus and the clock domain crossings that would be required certainly will not be beneficial to the overall latency.

gkasprow commented 7 years ago

One can use ARM CPU with large on chip SRAM which should be more deterministic than external SDRAM. Providing that 10MB of memory is enough.

https://www.renesas.com/en-eu/products/microcontrollers-microprocessors/rz/rza/rza1h.html

dtcallcock commented 7 years ago

@jordens

Are there ways ARTIQ performance could be improved (via changes to gateware or additional hardware) other than through DMA or application-specific gateware acceleration? I'm not saying there's a need for it or that it's a sensible thing to do, I just want a sense of what it'd involve.

jordens commented 7 years ago

AFAICS there are a bunch of different performance areas with different solutions:

Raw event rate. Best approach would be DMA (orders of magnitude), some other kind of tailored gateware support (order of magnitude), or an improved register layout and tweaked event submission code (factors of a few).
Floating point math. Add an FPU or using/adding a hard CPU. Could give 1-2 orders of magnitude speed up if the interface doesn't eat up the gain.
Integer math. Gateware support or faster/external CPU if the interface doesn't eat up the gain.
Actual event round trip latency (if you want to react to something). Better register layout, maybe faster CPU if latency or bus crossings don't eat up all the advantage. This will get worse with DRTIO by maybe 100-200ns per layer. The only other way around that is local feedback with gateware support.
Experiment/kernel startup. That's the time it takes to spawn a python process or compile the kernel. Improvements could be in the linker or better pooling/reuse of workers and caching of kernels on the host or on the core device.

sbourdeauducq commented 7 years ago

Potentially, there is also the option of a tighter coupling of the RTIO core with the CPU, so it doesn't have to program all the RTIO CSRs through external bus transactions at every event.

jordens commented 7 years ago

Yes. That's what I meant with "improved register layout and tweaked event submission code".

cjbe commented 7 years ago

@jordens The issues I am primarily worried about are:

Raw event rate: I am happy with using DMA to fix the occasional tight part of a sequence. However I am already having to add padding delays all over the place in my current experiments to fix the timing. I definitely do not want to end up using DMA for everything - this is conceptually messy, and pushes more work to the user to manage the flow.
Maths - I agree that this could be handled by an external processor / FPU, modulo the complexity to the user to mark up where different bits of code need to run
Reaction latency: A lot of this is in gateware, which can be improved. However, for all but the simplest use cases one needs to do some maths, which hurts currently.

I understand that the difficulty with a hard CPU is getting a low latency / high bandwidth interface, which pushes us to either an external co-processor (with potentially high latency) or a FPGA with a hard core (e.g. Zynq).

It seems like there are firm advantages to using an FPGA with a hard core (low latency, decent maths performance, no need for nasty 'offload CPU' complexity for the user). Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq?

This obviously involves significant changes to the gateware, but it feels that this is not grossly different from the effort required to write a soft FPU, or write a mechanism to pass jobs to an offload CPU.

What am I missing?

dtcallcock commented 7 years ago

@jordens Would you be willing to write a for-contract issue over on the artiq repository for "improved register layout and tweaked event submission code"? I agree with cjbe that it'd be nice not to have to use DMA everywhere to get a raw event rate out of artiq that's adequate for the bulk of experiments. It's not clear whether 'factors of a few' will be enough but I feel it might be so it seems worth exploring.

sbourdeauducq commented 7 years ago

I definitely do not want to end up using DMA for everything - this is conceptually messy, and pushes more work to the user to manage the flow.

The last point can be improved if the compiler extracts and pre-computes DMA sequences, either automatically or with another context manager that you simply wrap around the sequence (no manual/explicit preparation and replaying of DMA segments).

Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq?

As mentioned or hinted above:

Low-latency is unclear (do you have hard numbers on this? try mapping the ARTIQ RTIO core on the AXI bus and execute C routines that represent typical kernels?). The Zynq ARM cores involve clock domain transfers that can be high latency (for example, look at the latency of the GTX transceivers, which move data in a simpler way than CPU buses do)
the current two-CPU system (non-realtime comms and management + real-time kernel) may not be doable
some Zynq ARM cores (e.g. the "realtime" one) only run at 600MHz
Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed 2) the compiler will have to produce ARM instructions 3) I expect the usual "wizard" reverse engineering, bugs, problems and workarounds for braindead designs that come with every hard Xilinx block.

which pushes us to either an external co-processor (with potentially high latency)

If the external processor has a good (synchronous, high bandwidth, low latency) external bus interface, it could potentially run the kernels better than a Zynq core does.

dnadlinger commented 7 years ago

Low-latency is unclear

With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010. This was in a different setting without any care taken to optimise the code, but it might be useful as an upper bound.

I believe ETH are getting 70-80 ns overall branching latency on a Zynq 7010 as well (end of TTL input window -> AXI -> CPU -> AXI -> TTL). The same caveat applies, though; I'm pretty sure minimising that has not been a focus there either.

the current two-CPU system (non-realtime comms and management + real-time kernel) may not be doable

Why would it not be?

Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed

The Xilinx drivers already exist (and are solid, if generally meh).

2) the compiler will have to produce ARM instructions

Trivial. The only interesting part would be the finer details of matching the C ABI chosen.

sbourdeauducq commented 7 years ago

With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010.

Ok, for this simple task, the soft CPU may not be that different. How many bus transfers per edge was that? The RTIO core needs quite a bit, plus a read after the event is posted to check the status (underflow, etc.) that incurs a full bus round trip.

Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed The Xilinx drivers already exist (and are solid, if generally meh).

Meh, yes, and will they integrate well with the rest of the code?

the compiler will have to produce ARM instructions Trivial. The only interesting part would be the finer details of matching the C ABI chosen.

Yes, that plus the usual collection of bugs and other problems that manifest themselves every single time you use software (the ARTIQ compiler, LLVM, and the unwinder) in a way it has not been used before.

jordens commented 7 years ago

@cjbe @dtcallcock I would first like to see a diagnosis and profile what is actually slow and why. This renews our request from a few years ago to see test cases and actual code. This does not mean that the improvements above are not good or unneeded. It's just to ensure (and do so in a CI fashion) that there are no bugs/obvious fixes that would improve things.

With little effort I had gotten around 120 ns (IIRC) of TTL round-trip latency with my old ventilator code which did hard timestamping. I have no idea how much tweaking the ETH guys applied and whether this was actually RTIO-like. They don't seem to publish their code.

gkasprow commented 7 years ago

If you want to use ZynQ some time in the future, I prepare HW which is essentially Sayma AMC but with ZynQ US+ chip It will have second FMC instead of SFPs but up to 4 SFPs can be installed on FMC. I will keep RTM compatibility with Sayma AMC. This board will be used for another project related with video processing, but can be used for ARTIQ as well.

hartytp commented 7 years ago

Closing this issue as: adding Zync etc to Sayma/Metlino is impractical at this point; and, many of the above concerns should be dealt with by DMA...

sinara-hw / sinara

Hard cores #47