Open moidx opened 1 month ago
I'd say 10 iterations for the loop works with a little more complex code, but I suppose closer to 30-40 would definitely cause this. Easiest test - run SHA256 with 64+ message, so CMD.HASH_STOP would be used before final steps to save context.
Thanks for reporting this @vsukhoml , and thanks for filing the issue @moidx .
Is it possible to workaround the issue by checking the FIFO empty status bit before triggering the stop? Alternatively, can we check the idle status bit before triggering the stop?
It would be good to understand what the status of the FIFO and the overall IP is when the issue occurs and when not. Emptying the FIFO takes just 16 clock cycles, and computing a block in order of 48 to 80 clock cycles. With 30 - 40 loop iterations I would expect that we signal the stop when the hardware is already idle.
Update: I could reliably reproduce this on the FPGA using this command:
bazel test --test_output=streamed --cache_test_results=no //sw/device/tests/crypto:hmac_multistream_functest_fpga_cw310_sival_rom_ext
With 40 loop iterations, the test reliably hangs when streaming the first segment. Printing the status register before signaling the stop reveals that the FIFO is always empty, independent whether the test fails or passes. All other status bits/fields are 0. Also the ERR_CODE and MSG_LENGTH_LOWER registers don't provide useful insight.
I will now collect some waves with the following command and report back tomorrow:
util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_sw_hmac_multistream -t vcs --purge --reseed 1 --waves vpd
I checked that if I'm just making a delay after HMAC_CMD_HASH_STOP_BIT, I can read a non-zero digest, which seems to be the state I'm looking for, but I can't continue - HMAC never reports INTR_STATE.hmac_done, so logic for next commands fails. Status can be either 2 or 3 though. I also tried to send hash_stop multiple times - it didn't affect behavior with no delay, but didn't help for the case with a delay. I suspect there is 96 cycle window between last byte/word of the block sent to FIFO and when hash_stop can be sent for regular operations. If hash_stop is delayed - it stuck in the state with no INTR_STATE.hmac_done.
I tried to "reset" HMAC by sending hash_process command after hash_stop and by sending invalid command with all 4 commands, but it didn't help, though STATUS.idle bit became set. For invalid command I got ERR_CODE=2, and couldn't reset this value, so stuck with status=00000003, err_code=00000002, intr_state=00000000
.
So, besides finding a workaround other than using SW or OTBN for SHA2 with context switching, it would be nice to have means to bring HMAC into operational state after software errors like sending invalid command.
While it is not a good use case, but I get to the same state with sequence of hash_start
and hash_stop
commands with zero length message in between. just one after another. status=2, intr_state=0. I'd say HMAC HW shall not hang on sequences like this.
Another update - I tested what it to send hash_stop
before I send last word of the block to fifo. it doesn't help and behavior is the same as with zero length message. Next command fails with intr_state.done not being set: intr_state=0 and status = 2. So it doesn't really work as a workaround.
Also, in my test with zero-length message between hash_start and hash_stop, I found that despite failure I read out correct SHA2 state in digest registers.
Thanks @vsukhoml, I'm trying to follow through your observations vs. the RTL but waveforms will definitely help a lot, so at this point I'm just suggesting other unverified ideas at high-level to try out.
Are you sure CFG.sha_en
is set and remains enabled throughout all of this? Because getting ERR_CODE=2 at some point makes me wonder.
I checked that if I'm just making a delay after HMAC_CMD_HASH_STOP_BIT, I can read a non-zero digest, which seems to be the state I'm looking for, but I can't continue - HMAC never reports INTR_STATE.hmac_done, so logic for next commands fails.
What happens if you read your digest, then deassert CFG.sha_en
as well as hash_start, hash_process, hash_stop and hash_continue? Does this bring it into operational state again?
@martin-velay and I could track down the RTL bug inside prim_sha2
:
assign idle_o = (fifo_st_q == FifoIdle) && (sha_st_q == ShaIdle) && !hash_go;
For the primitive to signal the idle status, both the internal FIFO and the SHA core need to be idle.
The problem here is that the FIFO FSM only reacts on the msg_feed_complete
pulse (resulting from the stop command) while being in the FifoWait
state, i.e., after the internal FIFO has been emptied and while the SHA core is processing:
FifoWait: begin
if (msg_feed_complete && one_chunk_done) begin // <-- Only here we listen on the msg_feed_complete
fifo_st_d = FifoIdle;
// hashing the full message is done
hash_done_next = 1'b1;
end else if (one_chunk_done) begin
fifo_st_d = FifoLoadFromFifo;
end else begin
fifo_st_d = FifoWait;
end
end
Once the core is done, the one_chunk_done
signal is raised and we go back in to the FifoLoadFromFifo
state. This state doesn't listen onto the msg_feed_complete
signal. The FSM is waiting for new data forever, but that new data is not coming because outside of the primitive, the stop is registered and we won't ever forward data again.
A potential fix would to make the FifoLoadFromFifo
listen on the msg_feed_complete
signal while the FIFO is empty:
FifoLoadFromFifo: begin
if (!shaf_rvalid) begin
// Wait until it is filled
fifo_st_d = FifoLoadFromFifo;
update_w_from_fifo = 1'b0;
if (msg_feed_complete) begin // <-- potential RTL fix to be confirmed.
fifo_st_d = FifoIdle; // <-- potential RTL fix to be confirmed.
end // <-- potential RTL fix to be confirmed.
end else if (w_index_q == 4'd 15) begin
fifo_st_d = FifoWait;
// To increment w_index and it rolls over to 0
update_w_from_fifo = 1'b1;
end else begin
fifo_st_d = FifoLoadFromFifo;
update_w_from_fifo = 1'b1;
end
end
A possible software workaround could be to:
Between sending the last word and sending the stop, we do have 64 clock cycles (SHA2-256) or 80 clock cycles (SHA2-384, 512). So we need to be fast. I think disabling interrupts is the only option we have here.
Here some waves to visualize the situation in the RTL:
I have reproduced the bug at block level, thanks for raising this issue and for the details.
Could you please elaborate on "workaround where we "insert" the delay before writing the last word in the message FIFO: writing "block_size - 1 word" words, inserting a large delay, writing the last word into the FIFO, then triggering the stop. And it seems to work well. If this can be acceptable, then it's fine. Otherwise we should find out another solution."
Does it make things insensitive to delay between last word/byte sent to fifo and hash_stop? How long shall be a delay?
I tried to unblock HMAC when it stuck with continuing writes to MSG_FIFO after failed hash_stop, but it seems didn't work. So far various combinations of writes to fifo and cycling CFG_SHA_EN didn't change things. I observed fifo entry count in STATUS growing, but w/o anything, and once I reach 128 bytes it seems that I hang on the write to fifo(?).
As for disabling interrupts - it may be a solution for some use cases, but not others - e.g. when cryptolib runs in the user mode, so can't really disable interrupts other than with some support from OS.
I also tried delay (for (size_t i = 0; i < 1000; i = launder32(i + 1)) ;
) before the last word or byte of the block, but it doesn't change anything for me. To do that I check MESSAGE_LENGTH_LOWER would be aligned for a block for a given data, and if so, depending on alignment choose where to add delay - before last word or byte and do it.
Could you please elaborate on "workaround where we "insert" the delay before writing the last word in the message FIFO: writing "block_size - 1 word" words, inserting a large delay, writing the last word into the FIFO, then triggering the stop. And it seems to work well. If this can be acceptable, then it's fine. Otherwise we should find out another solution."
Does it make things insensitive to delay between last word/byte sent to fifo and hash_stop? How long shall be a delay?
I tried to unblock HMAC when it stuck with continuing writes to MSG_FIFO after failed hash_stop, but it seems didn't work. So far various combinations of writes to fifo and cycling CFG_SHA_EN didn't change things. I observed fifo entry count in STATUS growing, but w/o anything, and once I reach 128 bytes it seems that I hang on the write to fifo(?).
I am not sure it's relevant to solve the problem. But let's take a concrete example (simulation below), I am in SHA2-256 mode so I am expecting blocks of 512bits and my message size is 712 bits. Here is my sequence:
Then I get the HASH done signal as expected for the intermediate HASH after the stop.
And got another HASH done. And the digest is matching with the model.
Another point, I have also tried a bit to move out from this hang state, but I haven't succeeded so far. Unfortunately I can hardly work on this this week, but I'll try to. If we trigger a hash_go
(which is hash_start
or hash_continue
) we can navigate into the fifo_st_q
FSM:
FifoIdle: begin
if (hash_go) fifo_st_d = FifoLoadFromFifo;
else fifo_st_d = FifoIdle;
end
And then hopefully getting out from this limbo state maybe by writing in the message FIFO or doing something like that, I need to dig more, I hope I'll have time later today.
Once there is a hang detected, I think we want to have hash_go = 0
(so both hash_start = 0
and hash_continue = 0
and stay so, both should be low at this point in time anyway) and then deassert sha_en
(all commands stay low), this should navigate control to the FSM state fifo_st_d = FifoIdle
and sha_st_d = ShaIdle
, and would also assert the idle_o.
Then feed sha_en = 1
again and start over with a new hash_start/continue (only issue hash_start
or hash_continue
when sha_en = 1
to avoid triggering error states). Can we try this sequence @martin-velay @vsukhoml?
@martin-velay, unfortunately this is not a viable workaround:
Trigger a stop, within a delay shorter than the time needed to process the HASH around 64 clock cycles in SHA2-256 and 80 clock cycles in SHA2-384/512
The issue is that I can't control time between last byte/word and hash_stop due to interrupts. If I can disable interrupts, then workaround is to disable interrupts and make sure that compiler produce code which fulfills this requirement on timing.
In this hanged state I couldn't send any commands. So far I limited code to use HMAC only in hash_start/hash_process mode w/o context reloading, and implemented context reloading in software. I'll try again with hash_continue
later.
I'd split the problem in 2 parts:
@gdessouky I tried to play with SHA_EN = 0, then 1, but it didn't seem to recover. I'll also try this method.
@gdessouky I was able to get valid output by simply ignoring STATUS.IDLE =0 first time, and then I got STATUS.IDLE=1 for next commands, and ignoring INTR_STATE.hmac_done - this never recovers and always remains 0. I tried sending various commands, including CMD=0 (no command), before/after/between cycling SHA_EN, but it didn't change anything.
So one workaround can be to simply use delay (or, loop with timeout serving same purpose when HMAC hanged) after command rather than relying on status bits. Still, I faced an incorrect output with HMAC SHA384 key len=50, message len=128
. SHA256/384/512 worked correctly though. Not sure if this is HMAC mode that matters somehow.
yes, instr_state.hmac_done
will not recover. This is only set in fifo_st_d = FifoWait
under particular conditions https://github.com/lowRISC/opentitan/blob/master/hw/ip/prim/rtl/prim_sha2.sv#L346, and once this hang state is triggered, I don't see a direct way to navigate to this. I believe HMAC core (for HMAC mode) waits on this signal from the SHA-2 engine that it is done hashing to start computing the second round and this never happens, so you will not get a correct HMAC output when HMAC mode is enabled, but you will get a correct SHA-2 output. But I still need to look at waveforms to confirm that there is no way we can recover HMAC output as well.
So my goal now is to recover the HMAC into operational state when this happens and be able to try again a new operation and get a result this time. I don't know how often is such a delay at this particular time expected, but I'd suggest you wait on either instr_state.hmac_done
or a timeout, whichever occurs first. In case of timeout, you recover HMAC into operational state by cycling SHA_EN -> 0 -> 1
, and then start the operation again, assuming it is unlikely you get an interrupt fired again at that particular point in time - do you think this sounds reasonable? Can you please test that if you start a new operation again (without delays) after recovering, you do get instr_state.hmac_done
asserted and you can read the correct HMAC output @vsukhoml?
@gdessouky I already do state by cycling SHA_EN -> 0 -> 1
, but it doesn't resolve full operational state for HMAC, only SHA continues to work.
Can you please test that if you start a new operation again (without delays) after recovering, you do get instr_state.hmac_done asserted and you can read the correct HMAC output
I observe failure on CMD=hash_stop. I also noticed that despite no hmac_done
I read correct state from DIGEST regs.
So you suggest that after failure I'd need to run some other command, e.g. hash_start and power cycle SHA_EN? I think I tried variations of this procedure in attempts to recover, but will try again.
So I tried to send hash_start
command once I detect intr_state==0 and then cycle SHA_EN, tried to send hash_start
followed by hash_process
and cycle SHA_EN, but can't get intr_state.hmac_done to be set and can't get correct HMAC result. SHA kind of works in this mode if it is ok to live with intr_state.hmac_done=0.
Hi @vsukhoml,
I worked on finding a way to get out from this hanged state in simulation. I think I have found a way out. It's probably not the ideal and maybe there is a simpler one.
Steps to follow with the explanations:
sha_hash_continue_o
based on the register reg_hash_continue
from hmac_core.svStIdle
state to StFifoReceive
from prim_sha2_pad.sv this will enable us to trigger shaf_rvalid_o
later, this will unlock us from the state fifo_st_q==FifoLoadFromFifo
in the block prim_sha2.sv.digest_on_blk
to be able to move into done_state_d==DoneAwaitCmd
in hmac.sv this will then lead to trigger hash_done_event
and be back in a stable state on all the FSMs.shaf_rvalid_o
from prim_sha2_pad.svstart_hash
intr_state.hmac_done
I am testing it currently, I'll keep you posted.
Well it seems to work, at least now I see it going into IDLE and failing as the digest is not matching, which is normal as we wrote some wrong data into the message FIFO. The trick is maybe to save the context before doing all these steps and restoring it once we are "back on our legs". As I think the available context is updated on the block boundaries even if the stop is not triggered. So, the digest
and the msg_length
should be correct. Then do what is described below and restore the context as usual. So far I haven't been able to get this to work in simulation, but I hope it will on your side.
Insert delay to reveal the issue:
// delay here, even this small is enough
for (size_t i = 0; i < 40; i = launder32(i + 1))
;
You can insert the following right after the delay:
// Time to tell HMAC HWIP to stop, because we do not have enough message
// bytes for another round.
uint32_t cmd_reg =
bitfield_bit32_write(HMAC_CMD_REG_RESVAL, HMAC_CMD_HASH_STOP_BIT, 1);
abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);
// Disable the HMAC to trigger sha_hash_continue_o based on the register reg_hash_continue
// from hmac_core.sv
uint32_t cfg_reg = abs_mmio_read32(kHmacBaseAddr + HMAC_CFG_REG_OFFSET);
cfg_reg = bitfield_bit32_write(cfg_reg, HMAC_CFG_HMAC_EN_BIT, false);
abs_mmio_write32(kHmacBaseAddr + HMAC_CFG_REG_OFFSET, cfg_reg);
// Trigger HASH continue to move from StIdle state to StFifoReceive from prim_sha2_pad.sv this
// will enable us to trigger shaf_rvalid_o later, this will unlock us from the state
// fifo_st_q==FifoLoadFromFifo in the block prim_sha2.sv.
cmd_reg = abs_mmio_read32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET);
cmd_reg = bitfield_bit32_write(cmd_reg, HMAC_CMD_HASH_CONTINUE_BIT, true);
abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);
// Get the current message length to know how much words we need to write to fall on the block
// boundary and trigger digest_on_blk to be able to move into done_state_d==DoneAwaitCmd in
// hmac.sv this will then lead to trigger hash_done_event and be back in a stable state on all
// the FSMs.
msg_len = abs_mmio_read32(kHmacBaseAddr + HMAC_MSG_LENGTH_LOWER_REG_OFFSET);
uint32_t digest_size = bitfield_field32_read(cfg_reg, HMAC_CFG_DIGEST_SIZE_FIELD);
// Compute next block boundary
uint32_t msg_length_to_wr;
if (digest_size == 1) { // which is SHA2-256
msg_length_to_wr = 512 - msg_len % 512;
} else {
msg_length_to_wr = 1024 - msg_len % 1024;
}
// Write a dummy message into the message FIFO to trigger shaf_rvalid_o from prim_sha2_pad.sv
for (int i=0; i<msg_length_to_wr/32; i++) {
abs_mmio_write32(kHmacBaseAddr + HMAC_MSG_FIFO_REG_OFFSET, 2863311530);
}
// Finally trigger hash_process
cmd_reg = abs_mmio_read32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET);
cmd_reg = bitfield_bit32_write(cmd_reg, HMAC_CMD_HASH_PROCESS_BIT, true);
abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);
@vsukhoml please keep me informed
@martin-velay , the purpose of hash_stop
is to read out the state of hash (which would be in DIGEST registers) to be able to load back with hash_continue
sometime later. When would it make sense to read out state in this sequence? I found that even if HMAC hangs, I can still read valid state from DIGEST registers, but in this sequence you disable HMAC (which is ok, I don't really need it until final step), and do hash_continue
, so I'm not sure if DIGEST would be valid.
@vsukhoml, from what I can see on simulation the digest
and the msg_length
are updated after each block even if we don't trigger hash_stop
. Using the stop command is required to move the internal logic into a clean state and to trigger a hash_done
.
Here below is a set of screenshots based on different scenarios:
As you can see the digest
is updated on the block boundary and is always the same in all the scenarios.
My suggestion is to take advantage of this behaviour to save the context after a stop if we notice a hang state. I've tested this on the FPGA and it works: I have inserted:
context_save(ctx);
Right after:
// Time to tell HMAC HWIP to stop, because we do not have enough message
// bytes for another round.
uint32_t cmd_reg =
bitfield_bit32_write(HMAC_CMD_REG_RESVAL, HMAC_CMD_HASH_STOP_BIT, 1);
abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);
And commented out the one after we wait for IDLE state:
// Wait for HMAC HWIP operation to be completed.
HARDENED_TRY(hmac_idle_wait());
// Store context into `ctx`.
// context_save(ctx);
Result:
//sw/device/tests/crypto:hmac_multistream_functest_fpga_cw310_sival_rom_ext PASSED in 19.3s
Executed 1 out of 1 test: 1 test passes.
I know it's not clean, we should save the context after being on IDLE as it was done before if we are not hanged and we shouldn't go through all the steps I have mentioned earlier. We should also re-enable the HMAC after the trick if it was enabled before. To sum-up (pseudo-code):
...
trigger_stop();
if (hanged_after_stop) {
save_context();
hmac_en_bak = CFG.hmac_en; // Store if HMAC is enabled of not (register CFG.hmac_en)
unhang_hmac(); // Execute the steps described [here](https://github.com/lowRISC/opentitan/issues/24767#issuecomment-2426954419)
CFG.hmac_en = hmac_en_bak; // Restore CFG.hmac_en as it was before
HARDENED_TRY(hmac_idle_wait());
} else {
HARDENED_TRY(hmac_idle_wait());
save_context();
}
...
What do you think? Is it a viable workaround?
@vsukhoml, I have prepared a PR https://github.com/lowRISC/opentitan/pull/24839, the code is not even compiling yet, but it was to show the principle of the proposed modification.
If you any of you @jadephilipoom / @ballifatih / @vsukhoml have time to work on it please push your changes on this PR. Thanks for your help! And sorry if my C code doesn't work (or comply).
@martin-velay this may work - I will try some time later. However, what you say about digest registers and state means that I can just wait for some time (or may be use some status bit?) after last byte/word of the block is sent to FIFO, skip hash_stop
, and then read digest and disable HMAC/SHA. This would be way simpler.
@vsukhoml, if the recover_hw_after_stop
function from this PR https://github.com/lowRISC/opentitan/pull/24839 is there we can probably skip the stop. But I think it's not such a bit deal to keep this write, as this recover_hw_after_stop
is only temporary and should be removed. Without this function, if you don't trigger the stop the HW should be in a state from where you probably cannot proceed.
@martin-velay , I came to a simpler workaround:
As the stop command seems to not be a strong requirement from a SW point of view, I have implemented the suggested workaround from @vsukhoml. The PR is available here https://github.com/lowRISC/opentitan/pull/24944.
I have tested it on the FPGA and at top level simulation with these commands:
bazel test --test_output=streamed --cache_test_results=no //sw/device/tests/crypto:hmac_multistream_functest_fpga_cw310_sival_rom_ext
util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_sw_hmac_multistream -t xcelium --fixed-seed 1 --gui
It seems to work and the FSMs seem to be in proper states. As this workaround is simpler than the proposed one here https://github.com/lowRISC/opentitan/pull/24839, I would suggest to go with this new solution. But it also means that each time the stop command were supposed to be triggered, instead we are inserting a delay. I suppose this is not an issue for the performance, but I prefer to raise the point.
Thanks for your proposal @vsukhoml and thanks for giving this a try and doing a PR @martin-velay ! This is nice progress. Once your PR #24839 is merged the following three tasks remain open:
To answer the question on why block-level DV did not catch this in the first place, I think there are multiple reasons:
a_valid_delay_max
) and for raising data ready bit (d_ready_delay_max
). But the default max delays are both 10 clock cycles which gives a max delay between HMAC receives the last data work and the stop command of 20 clock cycles which is not enough to cover this particular case here (we would need 64 or 80 clock cycles).FYI @moidx
Thanks @vogelpi for this summary. @moidx, I would like to add a few comments. Not to clear myself or to put the blame on somebody else, but more to find a better approach for next time.
Similar issues can be found in other blocks in the future, it may be wise to add a mandatory point in V3 for example where we run some relevant tests (/!\ not a smoke test) with longer random delays to try to mimic SW behaviour in case of interrupt (to be defined with SW specialists). To be verified and part of the verification plan, this should be defined in the specification. It will then have dedicated coverage metrics. Note: we probably shouldn't increase the register access delays from the beginning as the configuration could take ages. We can probably only enable this once the initial DUT configuration has been done. And this makes more sense as the SW probably won't get any interrupt during this phase.
The aim of verification is to ensure that the design behaves in accordance with its specifications. This approach can help to catch some issues, but cannot cover all real-world requirements, as it is not possible to specify them all. It's really difficult to predict in advance where the pitfalls are. That's why validation is useful. It aims to ensure that the ASIC will meet the needs of the end user and the requirements of the real world. I firmly believe that the role of FPGA-prototyping is crucial, because that's where the real software works with the real RTL in near-real-world conditions. This particular bug was discovered in this way, which is a good thing, but a little late because the TO had already taken place. So perhaps we should reconsider our flow and promote validation with FPGA-prototyping.
Nevertheless, I'm sincerely relieved that a viable SW workaround has been found. Thank you all!
@vogelpi, I have created a PR https://github.com/lowRISC/opentitan/pull/24966 to update the HMAC Programmer's Guide. I've done on the master for the moment.
Description
Certain delays between last message FIFO write and
HMAC_CMD_HASH_STOP_BIT
andHMAC_CMD_HASH_PROCESS
commands cause theHMAC.STATUS
register to get stuck withhmac_idle
bit cleared.Reproduction steps (pseudo-code):
During regular operating conditions, interrupts can fire at any time, causing delays similar to what is captured in the reproduction steps. This may cause hangs in the field that will be very difficult to debug / root cause.
Issue reported by @vsukhoml. CC: @vogelpi, @martin-velay, @johannheyszl, @gdessouky