Faster stack clearing - Githubissues

microsoft / cheriot-ibex

cheriot-ibex is a RTL implementation of CHERIoT ISA based on LowRISC's Ibex core.

Apache License 2.0

73 stars 14 forks source link

Faster stack clearing #5

Open davidchisnall opened 1 year ago

davidchisnall commented 1 year ago

Writing up the discussion and adding a few more thoughts:

We zero a chunk of the stack on every call and return from cross-compartment calls. It would be nice to have a state machine that zeroes a range of memory, starting at the top and moving downwards. This will have a top and a bottom, where the top is moved downwards on each store. If the main pipeline loads between the top and the bottom, it should read zeroes. If the main pipeline stores between the top and the bottom, it should stall until the top has moved past the location of the store.

In the common case, this state machine would zero enough of the stack in the background that the next function to run would not block.

If we context switch (take an interrupt) then we need to be able to stop this pause zeroing and resume later.

Ideally, we'd integrate this with the stack high watermark control. In normal operation, we will:

Begin zeroing from the stack pointer location down to the current stack high-water mark.
Reset the stack high-water mark to the current stack pointer.

Can we combine these operations so that moving the stack high-water mark is sufficient to start the revocation, using the current $csp as the authorising capability?

kliuMsft commented 1 year ago

Interrupts do make things a little trickier. Before I was thinking that we will simply stall the instruction in EX stage if it accesses a stack area not yet zeroed. With interrupt, we probably have to abort the instruction and restart it later. That's a bit different from the current interrupt semantics (wait for current instruction to finish and then take interrupt). Anyhow it should still be doable but needs to spend a little more time to think through it carefully.

Also - can we rely on firmware (switcher) to remember the stop pointer? Doing it in hardware might be problematic especially if there is nested interrupts (do we actually allow that?). If

davidchisnall commented 1 year ago

As long as we can read the state, the switcher can preserve it across context switches. We currently do stack zeroing with interrupts disables, so stalling the store until the state machine has caught up would be no worse than the current behaviour (though not idea, it fine for an initial version).

davidchisnall commented 1 year ago

I think, on context switch, we'd need to save and restore two words of state: the top and the bottom of the range being zeroed. It's possible that CSP has been modified between the start and end and so it would be nice if we could capture this as a capability and a stop address.

We currently have two CSRs for the stack high-water mark, the base (CSR_MSHWMB) and the current watermark (CSR_MSHWM). We modify the base only on context switch and we modify the top on call and return.

For asynchronous stack zeroing, we additionally need the following state:

The capability that authorises writing to the stack, which I'll call Zcap
The place to start zeroing (updated every time zeroing happens), which I'll call Ztop
The place to stop zeroing (should be the previous value of the watermark), which I'll call Zbase

I would propose the following interface:

Writing an address to a new CSR (protected with ASR permission) starts zeroing. This takes the Zcap from $csp, Zbase from the current CSR_MSHWM and Ztop from the new CSR_MSHWM value and updates CSR_MSHWM with the written value. If the value written here is not in the bounds of CSP, this traps.

When an interrupt fires, the zeroing stops (can be after an in-flight store has retired if necessary) and must be restarted.

For context switch, Zcap, Ztop, and Zbase are exposed as CSRs. We can define an order so that Ztop and Zbase must be written before Zcap and Zcap then triggers the zeroing to resume.

This would let us store 16 bytes of extra state in the thread structure and have simple control flow in the switcher for all paths.

kliuMsft commented 1 year ago

The interface sounds good. Still thinking about the implication on load/store instructions. Wouldn't this mean we have to do 2 checks in parallel for each load/store, one against the capability referred by the instruction, the other against the stack zeroing (stall if accessing stack area not yet zero'd)? Or we can simply stall all load/store while zeroing in progress?

davidchisnall commented 1 year ago

We would need to do two comparisons, but only when the stack zeroing is in use. That will depend a bit on how many cross compartment calls we use, but if we have a flip flow that’s set when the zeroing is finished then we can skip the additional checks while zeroing (and maybe power gate the comparators?). If we’re concerned about area, we could make loads and stores take an additional cycle while zeroing is happening (it would still be faster than doing it synchronously) and reuse the comparators from the capability check.

We don’t need to stall loads, we just return zero. We need to stall stores.

kliuMsft commented 1 year ago

The extra cycle idea sounds good - basically we can make all load/store at least 2 cycles when zeroizing the stack. May still stall both load and store since

Would like to make the 1st stalling decision as simple as possible for timing purpose, since it has to be done combinatorial, and the decision feeds into a lot of things. The subsequent decisions can be registered and less critical
It's true we don't have to stall load when zeroing, but extra cycle still buys more time for the address check logic to make decision on whether to issue the actual read.. From side channel perspective we'd rather not to issue read (vs read and replace the return data with zero).

The logic is kind of intricate and will take some time to implement.. But I guess it is worth the effort since it would really overlay the thread start time with stack zeroing.

From: David Chisnall @.> Sent: Monday, July 31, 2023 11:10 PM To: microsoft/cheriot-ibex @.> Cc: Comment @.>; Manual @.>; Subscribed @.***> Subject: Re: [microsoft/cheriot-ibex] Faster stack clearing (Issue #5)

We would need to do two comparisons, but only when the stack zeroing is in use. That will depend a bit on how many cross compartment calls we use, but if we have a flip flow that's set when the zeroing is finished then we can skip the additional checks while zeroing (and maybe power gate the comparators?). If we're concerned about area, we could make loads and stores take an additional cycle while zeroing is happening (it would still be faster than doing it synchronously) and reuse the comparators from the capability check.

We don't need to stall loads, we just return zero. We need to stall stores.

- Reply to this email directly, view it on GitHubhttps://github.com/microsoft/cheriot-ibex/issues/5#issuecomment-1659628314 or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3V7IMEXCM5SGID3DNXXV3DXTCMUDBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE2TSMBRGM3DKOBZQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHAZDAMZUG42TKMFHORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

kliuMsft commented 12 months ago

@davidchisnall, in the case of interrupts when a load/store (targeting the un-scrubbed area) is still stalling, we can either abort and throw a fault (mepcc updated with the address of aborted instruction and mcause set to cheri fault), or treat it more like the normal interrupt (mepcc points the next instruction and firmware has to restart). Which way do you prefer? In both cases we can still make sure that memory access doesn't really happen.

Also, for now can we stall loads to the unscrubbed stack region as well? I know we can return 0's but stalling might be simpler for hardware and I assume normally software won't read the unscrubbed stack so it's not a performance concern?

kliuMsft commented 12 months ago

Actually couple more questions

Do you see the need for the scrubber and hardware revoker active at the same time? If so I assume scrubber would have priority? It would be nice from hardware perspective if we only need to support one activity at a time, as the complexity and validation effort goes up quite a bit with concurrent activities..
Do we need a way for firmware to abort the scrubbing process by writing somewhere.. I think it's a good idea in case of we have a hw bug or misconfiguration? Anyhow if it's not needed architecturally, I might still add something as a memory-mapped debug register.
Does the scrubbing state machine really have to check against zcap? I think both ztop and zbase are already in the SR privileged domain?

davidchisnall commented 12 months ago

@davidchisnall, in the case of interrupts when a load/store (targeting the un-scrubbed area) is still stalling, we can either abort and throw a fault (mepcc updated with the address of aborted instruction and mcause set to cheri fault), or treat it more like the normal interrupt (mepcc points the next instruction and firmware has to restart). Which way do you prefer? In both cases we can still make sure that memory access doesn't really happen.

For an asynchronous interrupt, if the load / store hasn't happened then we want the MEPCC to point to the instruction that should be restarted.

Also, for now can we stall loads to the unscrubbed stack region as well? I know we can return 0's but stalling might be simpler for hardware and I assume normally software won't read the unscrubbed stack so it's not a performance concern?

I'm not sure how common this will be in software. Maybe stall for now but add a couple of performance counters to see for how long we're stalling for (on loads and stores).

Do you see the need for the scrubber and hardware revoker active at the same time? If so I assume scrubber would have priority? It would be nice from hardware perspective if we only need to support one activity at a time, as the complexity and validation effort goes up quite a bit with concurrent activities..

They're unrelated code paths in software. The scrubber is more urgent though so it's fine to suspend the revoker while the scrubber is running and use a single load/store pipeline for both.

Do we need a way for firmware to abort the scrubbing process by writing somewhere.. I think it's a good idea in case of we have a hw bug or misconfiguration? Anyhow if it's not needed architecturally, I might still add something as a memory-mapped debug register.

We want to stop when we context switch and resume when the suspended thread is resumed, so the switcher need to be able to run it before and after.

Does the scrubbing state machine really have to check against zcap? I think both ztop and zbase are already in the SR privileged domain?

My assumption was that both ztop and zbase are just addresses, whereas zcap is the capability that authorises them.

kliuMsft commented 11 months ago

Ok I added the feature in new commit (228c615). FPGA build looks okay.

New CSR (ztop) is 0xbc3 (one above MSHWMB). Writing to ZTOP starts the stack clearing engine. The engine goes down from the initial top value until it hits the stack base (which == MSHWM when stack clearing started).
ZTOP points the last address zeroed out and is continuously updated by hardware till the clear completes.
Note the first address zeroed out is the _initialZOP-4.
As discussed before, all load/store when stack clearing is ongoing takes 2 cycles when they are NOT targeting the area not yet cleared. A load/store access targeting the area not yet cleared will stall till either its target address is cleared, or the stack clearing process is aborted by an unmasked interrupt. In the latter case, a CHERI fault is generated for the load/store instruction involved.
Note that in the abort case, the fault takes precedence over the interrupt, e.g., there could be a fault exception (mepcc is set to faulted load/store) followed by an interrupt exception.
Note that

kliuMsft commented 11 months ago

@davidchisnall, @nwf-msr, I realized there are still a few things to be sorted out when we switched to ztop as an SCR.

Ztop is hardware-stored cap value. Ztop.address is continuously updated by hardware when zeroization in progress
When hardware is idle: -- write a tagged cap to update ztop and kick off zeroization (hardware will check address <= top and >= base before kicking it off). -- Write an untagged value: will still update ztop value (so a cspecialr can readback the same thing)
When hardware is active (zeroization in progress) -- Write a untagged value: currently this ignored, but it seems we want to use this to abort the zeroization? In this case do we expect to readback the value written (untagged) or the zeroization progress prior to stopping (as a tagged cap)? -- Write a tagged cap: currently this is also ignored by hardware. I'd prefer this way rather than using this as a way to abort/restart, for the sake of complexity -- Read ztop returns the progress (ztop.address). Ztop.tag is cleared when zeroization completes (ztop.address == ztop.base)
When zeroization aborted by an unmasked interrupt -- Hardware statemachine stops -- Reading ztop returns a valid cap if zeroization did not complete (ztop.address != ztop.base). -- If the current CPU instruction is a stalled load/store (to the uncleared stack region), it will fault. (in such case we could see a fault followed by an interrupt) -- Otherwise CPU will take the interrupt and move to ISR -- Question - note we don't have a way to explicit tell software that the hardware zeroization is busy (in progress). So if we read-back ztop as tagged, it could either by the hardware is idle/aborted or is still in progres.. Is that a problem?

davidchisnall commented 11 months ago

-- Write an untagged value: currently this ignored, but it seems we want to use this to abort the zeroization? In this case do we expect to readback the value written (untagged) or the zeroization progress prior to stopping (as a tagged cap)?

On context switch, we want to stop zeroing as soon as we switch away from the thread, for two reasons:

We don’t want zeroing in one thread to slow down another.
We will restart by writing the ztop value from the interrupted thread back to the SCR. If zeroing doesn’t stop then we zero the same region twice.

-- Write a tagged cap: currently this is also ignored by hardware. I'd prefer this way rather than using this as a way to abort/restart, for the sake of complexity

We currently always swap the CSR with null on interrupt and write the new thread’s value when resuming. As long as writing null stops, this is fine.

-- Read ztop returns the progress (ztop.address). Ztop.tag is cleared when zeroization completes (ztop.address == ztop.base)

That’s perfect, I can delete a conditional branch from the switcher code with this guarantee.

-- Hardware statemachine stops

We don’t need this, since we will explicitly stop it a few instructions into the interrupt handler. It doesn’t hurt though.

-- If the current CPU instruction is a stalled load/store (to the uncleared stack region), it will fault. (in such case we could see a fault followed by an interrupt)

The ideal behaviour here would be to not fault, but to rewind the MEPCC to the start of the interrupted instruction. I wonder if it’s possible to move the PCC update to after the zeroizer had approved to instruction?

-- Question - note we don't have a way to explicit tell software that the hardware zeroization is busy (in progress). So if we read-back ztop as tagged, it could either by the hardware is idle/aborted or is still in progres.. Is that a problem?

That should be fine. We treat the zeroization state as just another part of thread state. As long as we can stop it for a region and restart it for a region, we don’t care if one of those regions is empty. We just spill and reload the untagged value and the hardware ignores it. If we interrupt a thread during zeroing, we store its ztop and resume it later.

kliuMsft commented 11 months ago

The ideal behaviour here would be to not fault, but to rewind the MEPCC to the start of the interrupted instruction. I wonder if it’s possible to move the PCC update to after the zeroizer had approved to instruction?

Rewinding mepcc is the fault behavior though.. And since we only support direct/non-vectored exceptions, to me it seems not much different. we can certainly use a different mcause to signal this is a special case?

davidchisnall commented 11 months ago

Rewinding mepcc is the fault behavior though.. And since we only support direct/non-vectored exceptions, to me it seems not much different. we can certainly use a different mcause to signal this is a special case?

The thing that I don't like is needing to enter the interrupt handler twice to deliver the interrupt, once for the fault and once for the interrupt. Ideally, we'd just report the interrupt, but with an MEPCC value that meant that we could resume.

I don't really want to get a fault here, because it isn't a fault. A fault will trigger the error-handling code paths, but there wasn't an error and so they will do the wrong thing. If we had a different error code, we could just fall into the interrupt code path, but we don't need the interrupt. Ideally, we'd have the interrupt cause, but the MEPCC set to the correct value for a fault.