[hmac] `HMAC_CMD_HASH_STOP_BIT` and `HMAC_CMD_HASH_PROCESS` commands causing hang conditions

moidx commented 1 month ago

Description

Certain delays between last message FIFO write and HMAC_CMD_HASH_STOP_BIT and HMAC_CMD_HASH_PROCESS commands cause the HMAC.STATUS register to get stuck with hmac_idle bit cleared.

Reproduction steps (pseudo-code):

 status = msg_fifo_write(data, len - leftover_len);

 // delay here, even  this small is enough
 for (size_t i = 0; i < 10; i = launder32(i + 1))
    ;

 // Time to tell HMAC HWIP to stop, because we do not have enough message
 // bytes for another round.
 uint32_t cmd_reg =
     bitfield_bit32_write(HMAC_CMD_REG_RESVAL, HMAC_CMD_HASH_STOP_BIT, 1);
 abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);

 // Wait for HMAC HWIP operation to be completed.
 // In the error condition, this function will block forever.
 status = hmac_idle_wait();

During regular operating conditions, interrupts can fire at any time, causing delays similar to what is captured in the reproduction steps. This may cause hangs in the field that will be very difficult to debug / root cause.

Issue reported by @vsukhoml. CC: @vogelpi, @martin-velay, @johannheyszl, @gdessouky

vsukhoml commented 1 month ago

I'd say 10 iterations for the loop works with a little more complex code, but I suppose closer to 30-40 would definitely cause this. Easiest test - run SHA256 with 64+ message, so CMD.HASH_STOP would be used before final steps to save context.

vogelpi commented 1 month ago

Thanks for reporting this @vsukhoml , and thanks for filing the issue @moidx .

Is it possible to workaround the issue by checking the FIFO empty status bit before triggering the stop? Alternatively, can we check the idle status bit before triggering the stop?

It would be good to understand what the status of the FIFO and the overall IP is when the issue occurs and when not. Emptying the FIFO takes just 16 clock cycles, and computing a block in order of 48 to 80 clock cycles. With 30 - 40 loop iterations I would expect that we signal the stop when the hardware is already idle.

vogelpi commented 1 month ago

Update: I could reliably reproduce this on the FPGA using this command:

bazel test --test_output=streamed --cache_test_results=no //sw/device/tests/crypto:hmac_multistream_functest_fpga_cw310_sival_rom_ext

With 40 loop iterations, the test reliably hangs when streaming the first segment. Printing the status register before signaling the stop reveals that the FIFO is always empty, independent whether the test fails or passes. All other status bits/fields are 0. Also the ERR_CODE and MSG_LENGTH_LOWER registers don't provide useful insight.

I will now collect some waves with the following command and report back tomorrow:

util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_sw_hmac_multistream -t vcs --purge --reseed 1 --waves vpd

vsukhoml commented 1 month ago

I checked that if I'm just making a delay after HMAC_CMD_HASH_STOP_BIT, I can read a non-zero digest, which seems to be the state I'm looking for, but I can't continue - HMAC never reports INTR_STATE.hmac_done, so logic for next commands fails. Status can be either 2 or 3 though. I also tried to send hash_stop multiple times - it didn't affect behavior with no delay, but didn't help for the case with a delay. I suspect there is 96 cycle window between last byte/word of the block sent to FIFO and when hash_stop can be sent for regular operations. If hash_stop is delayed - it stuck in the state with no INTR_STATE.hmac_done.

I tried to "reset" HMAC by sending hash_process command after hash_stop and by sending invalid command with all 4 commands, but it didn't help, though STATUS.idle bit became set. For invalid command I got ERR_CODE=2, and couldn't reset this value, so stuck with status=00000003, err_code=00000002, intr_state=00000000.

So, besides finding a workaround other than using SW or OTBN for SHA2 with context switching, it would be nice to have means to bring HMAC into operational state after software errors like sending invalid command.

vsukhoml commented 1 month ago

While it is not a good use case, but I get to the same state with sequence of hash_start and hash_stop commands with zero length message in between. just one after another. status=2, intr_state=0. I'd say HMAC HW shall not hang on sequences like this.

vsukhoml commented 1 month ago

Another update - I tested what it to send hash_stop before I send last word of the block to fifo. it doesn't help and behavior is the same as with zero length message. Next command fails with intr_state.done not being set: intr_state=0 and status = 2. So it doesn't really work as a workaround.

Also, in my test with zero-length message between hash_start and hash_stop, I found that despite failure I read out correct SHA2 state in digest registers.

gdessouky commented 1 month ago

Thanks @vsukhoml, I'm trying to follow through your observations vs. the RTL but waveforms will definitely help a lot, so at this point I'm just suggesting other unverified ideas at high-level to try out.

Are you sure CFG.sha_en is set and remains enabled throughout all of this? Because getting ERR_CODE=2 at some point makes me wonder.
I checked that if I'm just making a delay after HMAC_CMD_HASH_STOP_BIT, I can read a non-zero digest, which seems to be the state I'm looking for, but I can't continue - HMAC never reports INTR_STATE.hmac_done, so logic for next commands fails.

What happens if you read your digest, then deassert CFG.sha_en as well as hash_start, hash_process, hash_stop and hash_continue? Does this bring it into operational state again?

vogelpi commented 1 month ago

@martin-velay and I could track down the RTL bug inside prim_sha2:

assign idle_o = (fifo_st_q == FifoIdle) && (sha_st_q == ShaIdle) && !hash_go;

For the primitive to signal the idle status, both the internal FIFO and the SHA core need to be idle.

The problem here is that the FIFO FSM only reacts on the msg_feed_complete pulse (resulting from the stop command) while being in the FifoWait state, i.e., after the internal FIFO has been emptied and while the SHA core is processing:

      FifoWait: begin
        if (msg_feed_complete && one_chunk_done) begin // <-- Only here we listen on the msg_feed_complete
          fifo_st_d      = FifoIdle;
          // hashing the full message is done
          hash_done_next = 1'b1;
        end else if (one_chunk_done) begin
          fifo_st_d      = FifoLoadFromFifo;
        end else begin
          fifo_st_d      = FifoWait;
        end
      end

Once the core is done, the one_chunk_done signal is raised and we go back in to the FifoLoadFromFifo state. This state doesn't listen onto the msg_feed_complete signal. The FSM is waiting for new data forever, but that new data is not coming because outside of the primitive, the stop is registered and we won't ever forward data again.

A potential fix would to make the FifoLoadFromFifo listen on the msg_feed_complete signal while the FIFO is empty:

      FifoLoadFromFifo: begin
        if (!shaf_rvalid) begin
          // Wait until it is filled
          fifo_st_d          = FifoLoadFromFifo;
          update_w_from_fifo = 1'b0;
          if (msg_feed_complete) begin // <-- potential RTL fix to be confirmed.
            fifo_st_d = FifoIdle;      // <-- potential RTL fix to be confirmed.
          end                          // <-- potential RTL fix to be confirmed.
        end else if (w_index_q == 4'd 15) begin
          fifo_st_d = FifoWait;
          // To increment w_index and it rolls over to 0
          update_w_from_fifo = 1'b1;
        end else begin
          fifo_st_d          = FifoLoadFromFifo;
          update_w_from_fifo = 1'b1;
        end
      end

A possible software workaround could be to:

Only send in the last word of the last block, if we now we're going to stop or to continue afterwards.
Or two always keep two full blocks ready.
Before sending the last word of the last block into the message FIFO before stopping, disable all interrupts using a CSR instruction.

Between sending the last word and sending the stop, we do have 64 clock cycles (SHA2-256) or 80 clock cycles (SHA2-384, 512). So we need to be fast. I think disabling interrupts is the only option we have here.

vogelpi commented 1 month ago

Here some waves to visualize the situation in the RTL:

martin-velay commented 1 month ago

I have reproduced the bug at block level, thanks for raising this issue and for the details.

Here is a potential RTL fix for this issue: https://github.com/lowRISC/opentitan/pull/24771
Meanwhile, I have also tested the potential workaround where we "insert" the delay before writing the last word in the message FIFO: writing "block_size - 1 word" words, inserting a large delay, writing the last word into the FIFO, then triggering the stop. And it seems to work well. If this can be acceptable, then it's fine. Otherwise we should find out another solution.
We should also investigate if we can recover from this hanged state with the current RTL.

vsukhoml commented 1 month ago

Could you please elaborate on "workaround where we "insert" the delay before writing the last word in the message FIFO: writing "block_size - 1 word" words, inserting a large delay, writing the last word into the FIFO, then triggering the stop. And it seems to work well. If this can be acceptable, then it's fine. Otherwise we should find out another solution."

Does it make things insensitive to delay between last word/byte sent to fifo and hash_stop? How long shall be a delay?

I tried to unblock HMAC when it stuck with continuing writes to MSG_FIFO after failed hash_stop, but it seems didn't work. So far various combinations of writes to fifo and cycling CFG_SHA_EN didn't change things. I observed fifo entry count in STATUS growing, but w/o anything, and once I reach 128 bytes it seems that I hang on the write to fifo(?).

vsukhoml commented 1 month ago

As for disabling interrupts - it may be a solution for some use cases, but not others - e.g. when cryptolib runs in the user mode, so can't really disable interrupts other than with some support from OS.

I also tried delay (for (size_t i = 0; i < 1000; i = launder32(i + 1)) ;) before the last word or byte of the block, but it doesn't change anything for me. To do that I check MESSAGE_LENGTH_LOWER would be aligned for a block for a given data, and if so, depending on alignment choose where to add delay - before last word or byte and do it.

martin-velay commented 1 month ago

Could you please elaborate on "workaround where we "insert" the delay before writing the last word in the message FIFO: writing "block_size - 1 word" words, inserting a large delay, writing the last word into the FIFO, then triggering the stop. And it seems to work well. If this can be acceptable, then it's fine. Otherwise we should find out another solution."

Does it make things insensitive to delay between last word/byte sent to fifo and hash_stop? How long shall be a delay?

I tried to unblock HMAC when it stuck with continuing writes to MSG_FIFO after failed hash_stop, but it seems didn't work. So far various combinations of writes to fifo and cycling CFG_SHA_EN didn't change things. I observed fifo entry count in STATUS growing, but w/o anything, and once I reach 128 bytes it seems that I hang on the write to fifo(?).

I am not sure it's relevant to solve the problem. But let's take a concrete example (simulation below), I am in SHA2-256 mode so I am expecting blocks of 512bits and my message size is 712 bits. Here is my sequence:

Writing 60 words of 8 bits in the message FIFO
Add delay of 200us (could be as long as you need)
Writing the last 32 bits to reach the 512 block boundary
Trigger a stop, within a delay shorter than the time needed to process the HASH around 64 clock cycles in SHA2-256 and 80 clock cycles in SHA2-384/512

Then I get the HASH done signal as expected for the intermediate HASH after the stop.

then I triggered HASH to continue
fill the remaining message up to 712 bits
trigger HASH process

And got another HASH done. And the digest is matching with the model.

martin-velay commented 1 month ago

Another point, I have also tried a bit to move out from this hang state, but I haven't succeeded so far. Unfortunately I can hardly work on this this week, but I'll try to. If we trigger a hash_go (which is hash_start or hash_continue) we can navigate into the fifo_st_q FSM:

      FifoIdle: begin
        if (hash_go) fifo_st_d = FifoLoadFromFifo;
        else         fifo_st_d = FifoIdle;
      end

And then hopefully getting out from this limbo state maybe by writing in the message FIFO or doing something like that, I need to dig more, I hope I'll have time later today.

gdessouky commented 1 month ago

Once there is a hang detected, I think we want to have hash_go = 0 (so both hash_start = 0 and hash_continue = 0 and stay so, both should be low at this point in time anyway) and then deassert sha_en (all commands stay low), this should navigate control to the FSM state fifo_st_d = FifoIdle and sha_st_d = ShaIdle, and would also assert the idle_o. Then feed sha_en = 1 again and start over with a new hash_start/continue (only issue hash_start or hash_continue when sha_en = 1 to avoid triggering error states). Can we try this sequence @martin-velay @vsukhoml?

vsukhoml commented 1 month ago

@martin-velay, unfortunately this is not a viable workaround:

Trigger a stop, within a delay shorter than the time needed to process the HASH around 64 clock cycles in SHA2-256 and 80 clock cycles in SHA2-384/512

The issue is that I can't control time between last byte/word and hash_stop due to interrupts. If I can disable interrupts, then workaround is to disable interrupts and make sure that compiler produce code which fulfills this requirement on timing.

In this hanged state I couldn't send any commands. So far I limited code to use HMAC only in hash_start/hash_process mode w/o context reloading, and implemented context reloading in software. I'll try again with hash_continue later.

I'd split the problem in 2 parts:

identify this hang state - so far I have a timeout based loop checking for STATUS.IDLE and then for INTR_STATE.DONE up to 512 times (more than enough, can be reduced), and then if status==0x2 and intr_state==0, then I consider it is "hang" state.
in this state however it looks like digest registers contains a valid state which can be read-out.
then I tried to restore processing in different ways

@gdessouky I tried to play with SHA_EN = 0, then 1, but it didn't seem to recover. I'll also try this method.

vsukhoml commented 1 month ago

@gdessouky I was able to get valid output by simply ignoring STATUS.IDLE =0 first time, and then I got STATUS.IDLE=1 for next commands, and ignoring INTR_STATE.hmac_done - this never recovers and always remains 0. I tried sending various commands, including CMD=0 (no command), before/after/between cycling SHA_EN, but it didn't change anything.

So one workaround can be to simply use delay (or, loop with timeout serving same purpose when HMAC hanged) after command rather than relying on status bits. Still, I faced an incorrect output with HMAC SHA384 key len=50, message len=128. SHA256/384/512 worked correctly though. Not sure if this is HMAC mode that matters somehow.

gdessouky commented 1 month ago

yes, instr_state.hmac_done will not recover. This is only set in fifo_st_d = FifoWait under particular conditions https://github.com/lowRISC/opentitan/blob/master/hw/ip/prim/rtl/prim_sha2.sv#L346, and once this hang state is triggered, I don't see a direct way to navigate to this. I believe HMAC core (for HMAC mode) waits on this signal from the SHA-2 engine that it is done hashing to start computing the second round and this never happens, so you will not get a correct HMAC output when HMAC mode is enabled, but you will get a correct SHA-2 output. But I still need to look at waveforms to confirm that there is no way we can recover HMAC output as well.

So my goal now is to recover the HMAC into operational state when this happens and be able to try again a new operation and get a result this time. I don't know how often is such a delay at this particular time expected, but I'd suggest you wait on either instr_state.hmac_done or a timeout, whichever occurs first. In case of timeout, you recover HMAC into operational state by cycling SHA_EN -> 0 -> 1, and then start the operation again, assuming it is unlikely you get an interrupt fired again at that particular point in time - do you think this sounds reasonable? Can you please test that if you start a new operation again (without delays) after recovering, you do get instr_state.hmac_done asserted and you can read the correct HMAC output @vsukhoml?

vsukhoml commented 1 month ago

@gdessouky I already do state by cycling SHA_EN -> 0 -> 1, but it doesn't resolve full operational state for HMAC, only SHA continues to work.

Can you please test that if you start a new operation again (without delays) after recovering, you do get instr_state.hmac_done asserted and you can read the correct HMAC output

I observe failure on CMD=hash_stop. I also noticed that despite no hmac_done I read correct state from DIGEST regs. So you suggest that after failure I'd need to run some other command, e.g. hash_start and power cycle SHA_EN? I think I tried variations of this procedure in attempts to recover, but will try again.

vsukhoml commented 1 month ago

So I tried to send hash_start command once I detect intr_state==0 and then cycle SHA_EN, tried to send hash_start followed by hash_process and cycle SHA_EN, but can't get intr_state.hmac_done to be set and can't get correct HMAC result. SHA kind of works in this mode if it is ok to live with intr_state.hmac_done=0.

martin-velay commented 3 weeks ago

Hi @vsukhoml,

I worked on finding a way to get out from this hanged state in simulation. I think I have found a way out. It's probably not the ideal and maybe there is a simpler one.

Steps to follow with the explanations:

Disable the HMAC to trigger sha_hash_continue_o based on the register reg_hash_continue from hmac_core.sv
Trigger HASH continue to move from StIdle state to StFifoReceive from prim_sha2_pad.sv this will enable us to trigger shaf_rvalid_o later, this will unlock us from the state fifo_st_q==FifoLoadFromFifo in the block prim_sha2.sv.
Get the current message length to know how much words we need to write to fall on the block boundary and trigger digest_on_blk to be able to move into done_state_d==DoneAwaitCmd in hmac.sv this will then lead to trigger hash_done_event and be back in a stable state on all the FSMs.
Write a dummy message into the message FIFO to trigger shaf_rvalid_o from prim_sha2_pad.sv
Finally trigger start_hash
Then you should see intr_state.hmac_done

I am testing it currently, I'll keep you posted.

martin-velay commented 3 weeks ago

Well it seems to work, at least now I see it going into IDLE and failing as the digest is not matching, which is normal as we wrote some wrong data into the message FIFO. The trick is maybe to save the context before doing all these steps and restoring it once we are "back on our legs". As I think the available context is updated on the block boundaries even if the stop is not triggered. So, the digest and the msg_length should be correct. Then do what is described below and restore the context as usual. So far I haven't been able to get this to work in simulation, but I hope it will on your side.

Insert delay to reveal the issue:

  // delay here, even  this small is enough
  for (size_t i = 0; i < 40; i = launder32(i + 1))
    ;

You can insert the following right after the delay:

  // Time to tell HMAC HWIP to stop, because we do not have enough message
  // bytes for another round.
  uint32_t cmd_reg =
      bitfield_bit32_write(HMAC_CMD_REG_RESVAL, HMAC_CMD_HASH_STOP_BIT, 1);
  abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);

  // Disable the HMAC to trigger sha_hash_continue_o based on the register reg_hash_continue 
  // from hmac_core.sv
  uint32_t cfg_reg = abs_mmio_read32(kHmacBaseAddr + HMAC_CFG_REG_OFFSET);
  cfg_reg = bitfield_bit32_write(cfg_reg, HMAC_CFG_HMAC_EN_BIT, false);
  abs_mmio_write32(kHmacBaseAddr + HMAC_CFG_REG_OFFSET, cfg_reg);

  // Trigger HASH continue to move from StIdle state to StFifoReceive from prim_sha2_pad.sv this
  // will enable us to trigger shaf_rvalid_o later, this will unlock us from the state 
  // fifo_st_q==FifoLoadFromFifo in the block prim_sha2.sv.
  cmd_reg = abs_mmio_read32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET);
  cmd_reg = bitfield_bit32_write(cmd_reg, HMAC_CMD_HASH_CONTINUE_BIT, true);
  abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);

  // Get the current message length to know how much words we need to write to fall on the block
  // boundary and trigger digest_on_blk to be able to move into done_state_d==DoneAwaitCmd in 
  // hmac.sv this will then lead to trigger hash_done_event and be back in a stable state on all 
  // the FSMs.
  msg_len = abs_mmio_read32(kHmacBaseAddr + HMAC_MSG_LENGTH_LOWER_REG_OFFSET);
  uint32_t digest_size = bitfield_field32_read(cfg_reg, HMAC_CFG_DIGEST_SIZE_FIELD);

  // Compute next block boundary
  uint32_t msg_length_to_wr;
  if (digest_size == 1) { // which is SHA2-256
    msg_length_to_wr = 512 - msg_len % 512;
  } else {
    msg_length_to_wr = 1024 - msg_len % 1024;
  }

  // Write a dummy message into the message FIFO to trigger shaf_rvalid_o from prim_sha2_pad.sv
  for (int i=0; i<msg_length_to_wr/32; i++) {
    abs_mmio_write32(kHmacBaseAddr + HMAC_MSG_FIFO_REG_OFFSET, 2863311530);
  }

  // Finally trigger hash_process
  cmd_reg = abs_mmio_read32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET);
  cmd_reg = bitfield_bit32_write(cmd_reg, HMAC_CMD_HASH_PROCESS_BIT, true);
  abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);

@vsukhoml please keep me informed

vsukhoml commented 3 weeks ago

@martin-velay , the purpose of hash_stop is to read out the state of hash (which would be in DIGEST registers) to be able to load back with hash_continue sometime later. When would it make sense to read out state in this sequence? I found that even if HMAC hangs, I can still read valid state from DIGEST registers, but in this sequence you disable HMAC (which is ok, I don't really need it until final step), and do hash_continue, so I'm not sure if DIGEST would be valid.

martin-velay commented 3 weeks ago

@vsukhoml, from what I can see on simulation the digest and the msg_length are updated after each block even if we don't trigger hash_stop. Using the stop command is required to move the internal logic into a clean state and to trigger a hash_done.

Here below is a set of screenshots based on different scenarios:

HMAC stop without the delay causing the issue raised in this ticket:
HMAC stop with the delay:
No stop triggered:

As you can see the digest is updated on the block boundary and is always the same in all the scenarios.

My suggestion is to take advantage of this behaviour to save the context after a stop if we notice a hang state. I've tested this on the FPGA and it works: I have inserted:

  context_save(ctx);

Right after:

  // Time to tell HMAC HWIP to stop, because we do not have enough message
  // bytes for another round.
  uint32_t cmd_reg =
      bitfield_bit32_write(HMAC_CMD_REG_RESVAL, HMAC_CMD_HASH_STOP_BIT, 1);
  abs_mmio_write32(kHmacBaseAddr + HMAC_CMD_REG_OFFSET, cmd_reg);

And commented out the one after we wait for IDLE state:

  // Wait for HMAC HWIP operation to be completed.
  HARDENED_TRY(hmac_idle_wait());

  // Store context into `ctx`.
  // context_save(ctx);

Result:

//sw/device/tests/crypto:hmac_multistream_functest_fpga_cw310_sival_rom_ext PASSED in 19.3s
Executed 1 out of 1 test: 1 test passes.

I know it's not clean, we should save the context after being on IDLE as it was done before if we are not hanged and we shouldn't go through all the steps I have mentioned earlier. We should also re-enable the HMAC after the trick if it was enabled before. To sum-up (pseudo-code):

...
trigger_stop();
if (hanged_after_stop) {
  save_context();
  hmac_en_bak = CFG.hmac_en;  // Store if HMAC is enabled of not (register CFG.hmac_en)
  unhang_hmac();  // Execute the steps described [here](https://github.com/lowRISC/opentitan/issues/24767#issuecomment-2426954419)
  CFG.hmac_en = hmac_en_bak;  // Restore CFG.hmac_en as it was before
  HARDENED_TRY(hmac_idle_wait());
} else {
  HARDENED_TRY(hmac_idle_wait());
  save_context();
}
...

What do you think? Is it a viable workaround?

martin-velay commented 3 weeks ago

@vsukhoml, I have prepared a PR https://github.com/lowRISC/opentitan/pull/24839, the code is not even compiling yet, but it was to show the principle of the proposed modification.

If you any of you @jadephilipoom / @ballifatih / @vsukhoml have time to work on it please push your changes on this PR. Thanks for your help! And sorry if my C code doesn't work (or comply).

vsukhoml commented 3 weeks ago

@martin-velay this may work - I will try some time later. However, what you say about digest registers and state means that I can just wait for some time (or may be use some status bit?) after last byte/word of the block is sent to FIFO, skip hash_stop, and then read digest and disable HMAC/SHA. This would be way simpler.

martin-velay commented 3 weeks ago

@vsukhoml, if the recover_hw_after_stop function from this PR https://github.com/lowRISC/opentitan/pull/24839 is there we can probably skip the stop. But I think it's not such a bit deal to keep this write, as this recover_hw_after_stop is only temporary and should be removed. Without this function, if you don't trigger the stop the HW should be in a state from where you probably cannot proceed.

vsukhoml commented 3 weeks ago

@martin-velay , I came to a simpler workaround:

After send to fifo wait 80 cycles
Read the state (digest, message length)
run hash_process as usual (we don't care about result here, but it is needed to bring block into proper state). If I don't run hash_stop or hash_process, I'll get issue on hash_continue later.

martin-velay commented 2 weeks ago

As the stop command seems to not be a strong requirement from a SW point of view, I have implemented the suggested workaround from @vsukhoml. The PR is available here https://github.com/lowRISC/opentitan/pull/24944.

I have tested it on the FPGA and at top level simulation with these commands: bazel test --test_output=streamed --cache_test_results=no //sw/device/tests/crypto:hmac_multistream_functest_fpga_cw310_sival_rom_ext

util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_sw_hmac_multistream -t xcelium --fixed-seed 1 --gui

It seems to work and the FSMs seem to be in proper states. As this workaround is simpler than the proposed one here https://github.com/lowRISC/opentitan/pull/24839, I would suggest to go with this new solution. But it also means that each time the stop command were supposed to be triggered, instead we are inserting a delay. I suppose this is not an issue for the performance, but I prefer to raise the point.

vogelpi commented 2 weeks ago

Thanks for your proposal @vsukhoml and thanks for giving this a try and doing a PR @martin-velay ! This is nice progress. Once your PR #24839 is merged the following three tasks remain open:

[x] Document the SW workaround in the HMAC programmer's guide. This needs to happen on the master as well as on the earlgrey_1.0.0 branch. @martin-velay , can you please take care of this? -> #24966
[ ] Work out an HW fix that can potentially be applied as metal fix. Martin already worked out a fix #24771, but we should evaluate how feasible it is to apply this as a metal fix or whether we need to try simplifying the fix further.
[ ] With the HW fix in place, we should tune HMAC DV to improve the randomization of delays between register accesses for triggering commands. The randomization should allow for bigger maximum delays in the order of 80 clock cycles. Maybe DV can just add a random delay before every command write.

To answer the question on why block-level DV did not catch this in the first place, I think there are multiple reasons:

We randomize the timing of register accesses in block-level DV by default: The relevant TL-UL host agent randomizes the delay for raising the address valid bit (a_valid_delay_max) and for raising data ready bit (d_ready_delay_max). But the default max delays are both 10 clock cycles which gives a max delay between HMAC receives the last data work and the stop command of 20 clock cycles which is not enough to cover this particular case here (we would need 64 or 80 clock cycles).
No one was aware of a risk of the hardware missing the stop pulse somewhere internally. And hence DV wasn't looking for this specifically. Without knowing that this should be tested specifically, there is no way to catch this (except through randomization). And since the RTL was not meant to handle this case, there was also no missing coverage. Instead coverage numbers for HMAC are reasonably high (95% line, 97% cond, 100% toggle, 100% FSM, 98% branch, 98% assert, 99.85% group) and we carefully reviewed any gaps (see internal sign-off doc). I've reviewed this doc again and there is really no coverage gap that could hint to this issue. We did have - and still have - very high confidence in the hardware (also the feature was successfully tested at the top-level but these tests equally didn't trigger the bug).
One could argue that the HMAC design specifically isn't built super robust in the sense that it doesn't always use valid/ready handshakes. Valid/ready handshakes are used between FIFOs but the commands are in many cases communicated using pulses. This means the design implicitly assumes that lower level hardware blocks are in specific state when triggering such pulses. In other designs, we use valid/ready handshakes more widely meaning a command would remain asserted until it's actually acknowledged by the lower level block. I think this would have helped in this case here and we should motivate people to use handshakes more widely.

FYI @moidx

martin-velay commented 2 weeks ago

Thanks @vogelpi for this summary. @moidx, I would like to add a few comments. Not to clear myself or to put the blame on somebody else, but more to find a better approach for next time.

Similar issues can be found in other blocks in the future, it may be wise to add a mandatory point in V3 for example where we run some relevant tests (/!\ not a smoke test) with longer random delays to try to mimic SW behaviour in case of interrupt (to be defined with SW specialists). To be verified and part of the verification plan, this should be defined in the specification. It will then have dedicated coverage metrics. Note: we probably shouldn't increase the register access delays from the beginning as the configuration could take ages. We can probably only enable this once the initial DUT configuration has been done. And this makes more sense as the SW probably won't get any interrupt during this phase.

The aim of verification is to ensure that the design behaves in accordance with its specifications. This approach can help to catch some issues, but cannot cover all real-world requirements, as it is not possible to specify them all. It's really difficult to predict in advance where the pitfalls are. That's why validation is useful. It aims to ensure that the ASIC will meet the needs of the end user and the requirements of the real world. I firmly believe that the role of FPGA-prototyping is crucial, because that's where the real software works with the real RTL in near-real-world conditions. This particular bug was discovered in this way, which is a good thing, but a little late because the TO had already taken place. So perhaps we should reconsider our flow and promote validation with FPGA-prototyping.

Nevertheless, I'm sincerely relieved that a viable SW workaround has been found. Thank you all!

martin-velay commented 2 weeks ago

@vogelpi, I have created a PR https://github.com/lowRISC/opentitan/pull/24966 to update the HMAC Programmer's Guide. I've done on the master for the moment.

lowRISC / opentitan

[hmac] `HMAC_CMD_HASH_STOP_BIT` and `HMAC_CMD_HASH_PROCESS` commands causing hang conditions #24767

Description