improved register layout and tweaked event submission code

jordens commented 7 years ago

(from m-labs/sinara#47)

Potentially, there is also the option of a tighter coupling of the RTIO core with the CPU, so it doesn't have to program all the RTIO CSRs through external bus transactions at every event.
fewer registers (merge address and data, merge address and write_enable)
~delta-timestamp register~
CSR/memory-pinning ~or register-pinning~ of now
maybe event-generating gateware that makes "full" events from some channel-class dependent mmapped space
~maybe IRQs can be used to handle error conditions and perform submission retrials~
use bus wait states and bus errors to signal RTIO exceptions

sbourdeauducq commented 7 years ago

fewer registers (merge address and data)

We can merge address and write enable instead.

sbourdeauducq commented 7 years ago

maybe IRQs can be used to handle error conditions and perform submission retrials

No, because that will lose precise RTIO exceptions or create horrible race conditions. But we can use bus wait states and bus errors.

jordens commented 7 years ago

We can merge address and write enable instead.

We might be able to do both getting rid of two CSR writes in many cases:

merge address and data and let the event replacement happen (at the RTIO FIFO input side) using a channel-dependent bit mask.
merge write-enable into the channel number.

But we can use bus wait states and bus errors.

Good!

sbourdeauducq commented 7 years ago

Doing that to the channel register is difficult because DRTIO uses it to look up first the state of the remote channel, and test for underflow, FIFO full, etc. And it is also used for muxing status between multiple (D)RTIO cores, so if the gateware redoes the transaction with the channel first, you'll have to block until the full transaction is completed so you get a valid status. There would still be a speed-up though as the gateware can do the transaction in a few cycles.

sbourdeauducq commented 7 years ago

using a channel-dependent bit mask.

Is the bit-twiddling that will be required to compute the data faster than one register write? Note that we can have a second RTIO output function that writes to address zero by not doing the address CSR write.

jordens commented 7 years ago

Does it do that channel status lookup early to be in parallel with the CPU doing the address/data writes? Pretty sure that the bit twiddling is faster. But it also pays to have fewer arguments being passed up and down the API. And it saves DMA and DRTIO size increasing throughput. Having separate rtio backend functions hurts caching and code size/maintainability. The other thing we could do is merge that address into the channel number. Then the (compound) channel number would be {u8 core, u16 channel, u8 address}, maybe with permutations. AFAICT we will also want a "length" field for the data in the RTIO API so that the DRTIO master and the DMA storage engine can truncate. But maybe it can just count the number of data writes.

sbourdeauducq commented 7 years ago

Merging the address into the channel sounds OK.

sbourdeauducq commented 7 years ago

The DMA playback engine takes LSB-first data of arbitrary length with byte granularity and zeros the missing MSBs, and DRTIO similarly removes zeros in front of data. There is no gateware DMA storage engine and I don't know why you want one.

The new RTIO writes could be done this way, with a total of 4 bus write accesses (instead of the current 6) if the data is small:

channel with integrated address
timestamp (2 bus accesses as it's 64-bit)
data triggering the RTIO write when the LSB CSR is accessed

(plus status readout)

jordens commented 7 years ago

The automatic zero-stripping/extending sounds good. I'd think that there would be a DMA storage engine for either DMA input events or for remote DMA.

whitequark commented 7 years ago

register-pinning of now

If I recall correctly, I've tried this and it made no practical difference.

jordens commented 7 years ago

Since event submission (i.e. writing the now to CSR) seems to be as frequent if not more frequent than manipulation of now, why don't we do "CSR pinning of now", i.e. to the timestamp CSR of the RTIO interface. This may be problematic for DRTIO (it might be using a separate CSR) but that could probably be changed.

sbourdeauducq commented 7 years ago

DRTIO is not using a separate CSR for the timestamp. The (D)RTIO cores are demuxed after the CSRs, see cri.py.

whitequark commented 7 years ago

CSR pinning of now sounds good and easy to implement (just pin the global to the address of CSR instead of having a global in ksupport).

sbourdeauducq commented 6 years ago

The other thing we could do is merge that address into the channel number. Then the (compound) channel number would be {u8 core, u16 channel, u8 address}

@jordens Do you confirm that a 8-bit address is still sufficient for SAWG and all foreseen needs?

sbourdeauducq commented 6 years ago

@whitequark What do you think of now pinning to the CSR with atomicity implemented in gateware by committing the value when the 32 least significant bits are written?

jordens commented 6 years ago

I don't see anything that would need more than 8 bit address space.

sbourdeauducq commented 6 years ago

use bus wait states and bus errors to signal RTIO exceptions

Are bus errors working correctly with mor1kx and do they result in exceptions that can be traced?

whitequark commented 6 years ago

IIRC, bus errors work properly but only on writes, on reads error cycles are silently ignored.

On March 29, 2018 11:41:12 PM GMT+08:00, "Sébastien Bourdeauducq" notifications@github.com wrote:

use bus wait states and bus errors to signal RTIO exceptions

Are bus errors working correctly with mor1kx and do they result in exceptions that can be traced?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/m-labs/artiq/issues/636#issuecomment-377277444

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

sbourdeauducq commented 6 years ago

OK, that would be acceptable.

whitequark commented 6 years ago

This needs to be rechecked before committing to this implementation though, it was a while ago when I looked at it.

sbourdeauducq commented 6 years ago

Current plan:

pin now to the CSR with atomicity implemented in gateware (commit when one given half-word is written, needs LLVM support).
this also takes care of #950 and obsoletes the now save/restore mechanism.
note: modifiying now will clear the CRI master o_data register (see comments in cri.py), this should not cause any problem.
merge address with core/channel to make a single 32-bit word {u8 core, u16 channel, u8 address}; update ARTIQ-Python drivers (TTL, SPI, SAWG) accordingly.
trigger RTIO writes when the LSB of the data CSR is written (remove the we register).
(optionally) use the Wishbone wait state for flow control. Wishbone does not support aborting cycles, so the kernel CPU reset will have to be also connected to the CRI master to end the Wishbone transaction if it is in the wait state.
(optionally) use Wishbone bus errors to signal underflows and link errors. This depends on their correct implementation in mor1kx which should be tested in advance. The bus error exception handler should be able to raise the corresponding exceptions in ARTIQ-Python, and include the correct data (channel, timestamp, slack, code location). Channel/timestamp can probably be pulled from the CSRs.

whitequark commented 6 years ago

pin now to the CSR with atomicity implemented in gateware (commit when one given half-word is written, needs LLVM support).

Are you sure about this? Without nontrivial work it'll result in redundant reads/writes of now.

sbourdeauducq commented 6 years ago

How bad is the redundancy? Typically a now modification is followed by a RTIO write.

whitequark commented 6 years ago

On second thought, if we pin now in the linker and simply rely on whatever word write order LLVM already uses for i64 (it's consistent), then we won't lose anything. More importantly, this won't affect code motion across now, which is important for things like lowering FP delay operations to integer delay_mu ones.

sbourdeauducq commented 6 years ago

Also we do not need the now CSR to be always up to date. It must be up-to-date only when a RTIO write or read is done; the rest of the time, containing some valid value (i.e. one that has been set at some point by the user) that is atomically updated is sufficient.

sbourdeauducq commented 6 years ago

simply rely on whatever word write order LLVM already uses for i64 (it's consistent),

And LLVM will not go mad if the memory location doesn't behave like normal memory - the two locations are only updated at once when one of them is written?

whitequark commented 6 years ago

If we never treat that location as anything other than i64* then LLVM will never write only one half of it. Of course, if you manually cast either half to an i32* (which is only possible in Rust) without marking it as volatile then you will get a "miscompilation". The solution is to, well, not do that.

dhslichter commented 6 years ago

@jordens just to confirm, the 8-bit address limitation will mean we get max 256 channels on any given DDS or SPI bus? Does address correspond to branches for SAWG? I just want to confirm the meanings of core (indicating DRTIO device - e.g. a single Kasli or Sayma), channel (a single RTIO channel on that core, e.g. a TTL or a DDS bus or an SPI or a SAWG channel), and address (n/a for TTL, bus address for DDS bus or SPI, branch for SAWG?).

It would be really helpful to have at least basic documentation somewhere (e.g. on the wiki) for the (D)RTIO architecture (including the interface to wishbone and mor1kx), in broad strokes.

@sbourdeauducq with the proposal you have given above in https://github.com/m-labs/artiq/issues/636#issuecomment-377283395, what would the anticipated improvement in sustained RTIO event throughput be? Previously factors of "a few" have been quoted, which I take to mean 2 or 3 -- does this still sound accurate? @whitequark stated that he doesn't think the CSR pinning of now would have much effect, and in your comment https://github.com/m-labs/artiq/issues/636#issuecomment-265483788 it seems like we'd get a factor of 1.5 in bus transaction time (4 instead of 6), but that doesn't equate to a full factor of 1.5 in the overall time because of the mor1kx overhead.

sbourdeauducq commented 6 years ago

@jordens just to confirm, the 8-bit address limitation will mean we get max 256 channels on any given DDS or SPI bus?

No, the address is internal to the RTIO PHY; for example for TTL PHYs there is one address to set the direction and one to set the level when the direction is output, for SPI PHYs there is one address for the SPI data and another one for configuring the bus, etc. There is no direct mapping between the RTIO addresses and the number of devices downstream the PHY.

@sbourdeauducq with the proposal you have given above in #636 (comment), what would the anticipated improvement in sustained RTIO event throughput be?

With the new scheme, a RTIO transaction would boil down to:

generate channel/address (can be cached), timestamp and data in the kernel.
call the rtio_output function.
write the combined channel/address and data.

Currently it is:

generate channel, address (separately), timestamp and data in the kernel.
call rtio_output.
write the channel, address, timestamp, and data.
write the we register.
read and check the status register.

The linker pinning to the CSR will turn a cacheable memory location (accessible in 2 cycles when cached) into a non-cacheable location that is backed by FPGA registers, accessible in exactly 4 cycles at all times. And it will remove the separate register write and one argument of rtio_output.

For an extremely rough estimate, let's neglect the effect of this, and assume a cost of 1 for each basic operation (generate a value, call a function, write a register etc.). The cost would go down to 6, from 11.

dhslichter commented 6 years ago

@sbourdeauducq ack, thanks.

whitequark commented 6 years ago

@sbourdeauducq What do you think about mapping the RTIO registers to OR1K SPRs? I've just thought of a good way to expose those to LLVM; we can add a new address space so that accesses to SPRs are just pointer accesses and most optimizations still apply to them.

sbourdeauducq commented 6 years ago

Why does that help, single-cycle access? What kind of interface would that have? Wouldn't that cause problems with now pinning? Wouldn't that cause timing issues in the FPGA? Also that contributes to breaking compatibility with other CPUs. Is it really worth it?

whitequark commented 6 years ago

Why does that help, single-cycle access?

Yes.

What kind of interface would that have?

Interface where? To software? More or less the same, just use mfspr/mtspr in Rust and slightly different codegen in ARTIQ Python.

Wouldn't that cause problems with now pinning?

Quite the opposite, it will eliminate some inefficiency from now pinning (if pinned via the linker all accesses to now will still have to go via the GOT offset, which adds a few setup instructions and increases register pressure).

Wouldn't that cause timing issues in the FPGA?

I think you have more insight into this than me at this moment.

Also that contributes to breaking compatibility with other CPUs.

All mature soft CPUs I'm aware of have some sort of coprocessor interfaces.

Is it really worth it?

Well, how much do we want to shave cycles off RTIO submission time?

sbourdeauducq commented 6 years ago

Interface to gateware. How many of those registers can we have anyway? The RTIO data register is quite large.

sbourdeauducq commented 6 years ago

Well, how much do we want to shave cycles off RTIO submission time?

That would shave 4 cycles (32ns) and seems quite complicated to do.

dhslichter commented 6 years ago

Also that contributes to breaking compatibility with other CPUs. Is it really worth it?

It seems to me that ARTIQ is pretty heavily invested in mor1kx, and that changing to a different CPU (even if there were a better option, which it seems at present there isn't) would involve a great deal of work anyway.

whitequark commented 6 years ago

Interface to gateware.

   wire [OPTION_OPERAND_WIDTH-1:0] spr_bus_dat_o;
   wire         spr_bus_stb_o;
   wire         spr_bus_we_o;
   wire [15:0]      spr_sr_o;

How many of those registers can we have anyway? The RTIO data register is quite large.

At least 32K.

dhslichter commented 6 years ago

Well, how much do we want to shave cycles off RTIO submission time?

In my experience this is one of the main pain points for ARTIQ users, and whatever improvements we can squeeze out in a reasonable fashion are really welcome for all users.

whitequark commented 6 years ago

@dhslichter Something you can do to advance this is submit realistic benchmark code that I can profile. There might be inefficiencies that I don't currently expect in different parts of the stack.

dhslichter commented 6 years ago

@sbourdeauducq I assume that the proposals for shaving time off RTIO submissions would also shave time off RTIO retrieval (i.e. from reading timestamps from an input FIFO). Is this accurate?

dhslichter commented 6 years ago

@whitequark ack, I will send along some code for sideband cooling to give a sense of what we're up against, and to give you something to benchmark with.

sbourdeauducq commented 6 years ago

I assume that the proposals for shaving time off RTIO submissions would also shave time off RTIO retrieval (i.e. from reading timestamps from an input FIFO). Is this accurate?

Not really, apart from now pinning this only applies to outputs.

It seems to me that ARTIQ is pretty heavily invested in mor1kx, and that changing to a different CPU (even if there were a better option, which it seems at present there isn't) would involve a great deal of work anyway.

Not that much, it's basically a bit of system code, the exception handling/unwinder, and dealing with rustc/LLVM breakage that I cannot imagine will miss the opportunity to manifest itself. A lot of things are portable.

@whitequark instead of SPRs, we can maybe do Wishbone combinatorial feedback for writes; would that work (i.e. no mor1kx bugs, and single-cycle performance) and meet timing? And does it actually improve things - with the write buffer, does the 2-cycle access have an impact on performance at all?

And it's just 32ns at most...

dhslichter commented 6 years ago

@sbourdeauducq are there modifications that would help with input speeds as well?

Ack on the CPU portability comments.

whitequark commented 6 years ago

Not that much, it's basically a bit of system code, the exception handling/unwinder, and dealing with rustc/LLVM breakage that I cannot imagine will miss the opportunity to manifest itself. A lot of things are portable.

I don't imagine we're going to migrate to LM32 at this point because of the time investment for the LLVM backend and more importantly the lack of features in the CPU core, so the only option is RISC-V. I've just looked at RISC-V status in LLVM and it's somewhat underwhelming compared to what I expected.

As far as I can tell they have a decently working LLVM backend but very little of it is upstream (10/84 patches), so there's no real advantage over our current OR1K LLVM backend. I cannot imagine the situation with rustc is any better, though rustc needs only a tiny amount of architecture-specific code (it's something around 100 LOC for OR1K, I think).

The EH/unwinder changes are trivial (especially after I made them once for OR1K) so that's not a factor.

Apart from what you listed, what else should be handled is linking, but that's not major either.

To summarize, whether we should switch to RISC-V should be informed exclusively by what we win by using a different CPU softcore, since that's the only thing that will matter in practice, at least until RISC-V is fully upstream and widely used, which isn't going to happen until the end of 2018 or even 2019.

@sbourdeauducq Worth a try.

sbourdeauducq commented 6 years ago

For inputs, in addition to now pinning, we can shave 1 programming register access, plus another one and (in the fast code path) some tests by using Wishbone bus wait/error states.

dhslichter commented 6 years ago

@sbourdeauducq ack, thanks.

sbourdeauducq commented 6 years ago

(optionally) use the Wishbone wait state for flow control. (optionally) use Wishbone bus errors to signal underflows and link errors.

Currently, RTIO CSR writes go through the write buffer, and each <=32-bit write takes 2 cycles to complete, which (thanks to the write buffer) is done in parallel with the CPU preparing the next operation.

The problem with this scheme is, write error exceptions become asynchronous (raised some instructions after the write), so race conditions can occur. For example, a RTIOUnderflow can potentially be raised after the software has made a decision based on the incorrect information that no underflow has occured.

So to implement this, the write buffer would have to be disabled for RTIO CSRs. This raises two difficult issues:

Adding the write buffer bypass feature to mor1kx without obscure bugs.
The CSR interface has to complete writes in 1 cycle (i.e. Wishbone combinatorial feedback) to gain back the performance that is lost to not being able to background the 2-cycle writes anymore. It also has to be able to complete them in an arbitrarily large number of cycles. This makes meeting timing in the FPGA challenging, especially on the slow Artix-7 used in Kasli, which is already a pain (https://github.com/m-labs/artiq/issues/891).

I'm not sure if saving 1 CSR read and 1 test (dozens ns) is worth going through those difficulties.

sbourdeauducq commented 6 years ago

Maybe the best solution is to dig into the CPU pipeline and implement custom instructions for RTIO operations. But that's significantly more complicated than the other optimizations proposed here.

sbourdeauducq commented 6 years ago

Such instructions, on the other hand, could be pretty fast. With optimized assembly (the production of which by the compiler is another issue), producing a square waveform on a TTL would be something like (@whitequark correct me if I am wrong):

increment 64-bit timestamp: 2 cycles
invert level (RTIO data): 1 cycle
wait for the data to be available in the CPU pipeline: 3 cycles
RTIO write: 3 cycles
loop: 3 cycles

So, a total of ~96ns per TTL state change.

dnadlinger commented 6 years ago

So to implement this, the write buffer would have to be disabled for RTIO CSRs. This raises two difficult issues: […]

You could emit an msync after the stores for which you need precise exceptions to get around this, but I haven't checked whether that would make sense performance-wise.

m-labs / artiq

improved register layout and tweaked event submission code #636