Atomic CSR read-write difficult to implement efficiently

wdc-pnl commented 5 years ago

Our experience in designing the SweRV core has found that atomic read-write CSR instructions are inefficient, because their two destinations creates trouble with the pipeline design.

We would like to see a new "Z" extension that includes solely those CSR instructions covered by the pseudo-instructions: CSRR, CSRW, CSRWI, CSRS, CSRSI, CSRC and CSRCI.

Why does software have to see the old value when it is modifying the value of a CSR? Could the CSR instructions not covered by the above pseudo-instructions be deprecated?

aswaterman commented 5 years ago

The software rationale is that the instructions, as specified, are atomic with respect to interrupts and other events that can change the CSR's value. Splitting it up into the constituent parts would destroy the atomicity property.

The microarchitectural rationale is that having two destinations does not create a structural hazard, since the CSRs are (almost necessarily) a different bank of registers from the x-registers. Most of the designs I have worked on implement the CSR read-modify-write sequence in a separate unit at the pipeline's commit stage, so there is little interaction with the design of the integer pipeline.

jhauser-us commented 5 years ago

Why does software have to see the old value when it is modifying the value of a CSR?

Trap handlers currently depend on being able to swap a *scratch CSR with a general-purpose x register in one instruction.

Aside from that, it is sometimes convenient (though not an imperative) to be able to read the old value to save it for later restoring, at the same time one is changing a CSR.

You may have a valid point about the implementation impact (I can't say for sure), but the current spec for CSRs is probably too entrenched now to change in this way, regardless.

robertgolla commented 5 years ago

The problem SWERV had with the current RISCV CSR's is that, in many cases, the CSR's have 2 destinations. One gpr and one csr. No other RISCV instruction has 2 logical destinations (at least that i know of). The bypass logic gets much more complicated with 2 destinations versus only have 1 destination. In the future we're going to OOO with renaming and this instruction is not going to work out well with 2 destinatiassons. We're not going to 2X the write ports into reservation station, physical register file, bypasses, etc.

For SWERV, we ended up "unpiping" these instructions so that the bypass logic was not complicated. We could have also micro-coded it across the 2 pipes (it's a 2 issue machine) but that's not great either.

Atomicity of the CSR (read-modify-write) with respect to interrupts is not an issue. Either instruction happens or it doesn't.

I get that you can "swap" the a0 and mscratch register in the trap handler. But is it really worth it to burden all future HW generations with 2 destinations just so you could do a clever swap in 1 instruction? Why not define 2 scratch registers to setup the scratch space in memory that you want?

mscratch0 has pointer to temp storage. mscratch1 is general use.

mov mscratch1, a0 mov a0, mscratch0

sw a1, 0(a0) sw a2, 4(a0) sw a3 8(a0) sw a4 12(a0)

< body of interrupt service routine >

mov a0, mscratch1

Is it really too late to have an CSR extension that drops the instructions that write 2 destinations?

We'd be happy with that and we could write our trap handlers accordingly.

btw, future cores are likely to implement the 2 dest CSR instructions with long latencies (unpiped since it can be handled in the normal way - still ugly cause you need unpipe logic).

We're currently 8 or 9 cycles for the 2 dest CSR case vs the 1 cycle for the single dest CSR case.

aswaterman commented 5 years ago

If you perform CSR reads in the same pipeline stage you perform writes, or if you stall until older CSR writes to complete, there's no need to bypass CSR writes. It sounds like you came to the same conclusion, and rather than calling it "not great", I'd call it a perfectly reasonable thing to do. The performance of CSR reads isn't a first-order concern, and removing bypass paths that don't contribute much to IPC is standard operating procedure.

By similar logic, it's sound to avoid complicating the register-renaming circuitry in OOO pipelines by performing CSR writes at commit, and waiting until no CSR writes are outstanding before executing CSR reads.

Atomicity is an issue, because with the reduced set of operations proposed, some things that used to take only one CSR access now take two, and an interrupt can occur in between.

In any case, JohnH is right to remark that "the current spec for CSRs is probably too entrenched now to change in this way." Lots of software is distributed in binary form that makes use of them. Minutiae in the privileged architecture would've been designed differently if these instructions weren't assumed to be present.

robertgolla commented 5 years ago

btw, when you say "atomicity" are you referring to the ability to do an atomic swap (gpr,csr) that is possible with these dual-dest CSR instructions? Or something else?

Presumably you could disable interrupts IE=0, do 2 simpler instructions, and then re-enable IE=1. Would that eliminate the atomicity problem w.r.t. interrupts? btw, these 4 instructions would still run faster than the complex dual-dest CSR. Especially as pipe gets deeper. unpiping is typically a presync / postsync operation given no bypasses. presync since data you need may be in flight. postsync since you write at commit.

btw, curious to know how the dual-dest CSR made it into the ISA? was it there from day 1 or did it get introduced at some point?

IMO, this dual-dest CSR stuff should be removed from the ISA. It really doesn't fit in with the rest of the RISCV instructions.

You sure it's too late?

:-)

aswaterman commented 5 years ago

Yeah, the disable-interrupts + non-atomic sequence + re-enable-interrupts sequence generally suffices. But often software doesn't know whether interrupts are currently enabled (e.g. because of nested locking), and even the sequence to disable/re-enable interrupts becomes more expensive if you don't have the full CSRRx instructions! :-( The sequence looks something like this:

csrrci t0, sstatus, 2 # disable interrupts and remember old interrupt-enable state in t0
... # critical section
csrw sstatus, t0      # restore interrupt-enable state

But without csrrci it becomes something like

csrr t0, sstatus
csrci sstatus, 2
... # critical section
andi t0, t0, 2   # can't simply use csrw sstatus, t0, because sstatus might
csrs sstatus, t0 # have been changed between instructions 1 and 2

You can make the case that even this is acceptable. But either way, the extra cruft starts to pile up, with an attendant increase in code size.

As to when they were introduced, the CSRRW instruction was added in July 2013 (under a different name), and the full battery of CSR instructions was added in a Sept. 2013 revision. So, not quite day one, but over five years ago. So... yes, I'm pretty sure :)

robertgolla commented 5 years ago

Ok so be it. I suspect for future high end RISCV cores you will see the dual dest CSR microcoded into 2 simpler micro ops. HW will guarantee atomicity with respect to interrupts. But all pipelined. Assuming no synchronization is needed for the CSR operation.

aswaterman commented 5 years ago

Yes, I think that is plausible - at least for the scratch CSRs, which are attractive to rename.

riscv / riscv-isa-manual

Atomic CSR read-write difficult to implement efficiently #304