lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.53k stars 754 forks source link

[entropy_complex] Debug hang issues #22753

Closed vogelpi closed 4 months ago

vogelpi commented 5 months ago

Description

While testing #22114 for M3 and investigating a failure of the //sw/device/silicon_creator/manuf/skus/earlgrey_a0/sival_bringup:ft_provision_fpga_cw310_sival test, it was discovered that under certain circumstances, the entropy complex seems to hang (see https://github.com/lowRISC/opentitan/pull/22716#issue-2251855658 for a detailed description).

Changing the reseed rate allowed us to temporarily work around the issue but we need to track down the underlying bug and make sure it's fixed for TO. What we could / should do towards this:

FYI @timothytrippel @moidx

timothytrippel commented 5 months ago

On the software front, we could created a separate testcase that just contains the parts of the ft_provision test that cause the entropy complex hang up if that would be helpful? I.e., we could create a top-level test that attempts to load an OTBN program (which triggers a secure wipe of the i/d-mems; thus requesting lots of entropy) and then attempt to do a keymgr operation which requests even more entropy. We could make this TLT device-code oonly, then there would be something to run DV.

Though if there is a block level DV test that could already be used to tease out the root problem that may be quicker.

andreaskurth commented 5 months ago

@vogelpi will work on this as soon as M3 is complete. This may be related to an issue reported by @pamaury where in ROM_EXT a huge amount of entropy is requested.

By @moidx: We could set up the same configuration in a DV test. By @vogelpi: We have an existing DV test running; it just takes a long time.

vsukhoml commented 5 months ago

I was trying to dump entropy from entropy_src through uart for analysis. And for the case when I just read 4000000 bits (500000 words) in few long reads, each about 80KB of data - everything works fine. But when I switch to restart test where I read chunks of 1024/4096 bits and print them via UART i start to observe dropped symbols - hex lines becomes ragged and it is very easily reproduced after about 300-500 seconds of work. Is there any reasonable path how disable/enable entropy source (with EDN disabled) can lead to this?

I use test entropy_src_fw_observe_many_continous as a basis and change:

static status_t entropy_config(
    dif_entropy_src_single_bit_mode_t single_bit_mode) {

  // Disable the entropy complex.
  TRY(entropy_testutils_stop_all());
  // Disable all health tests.
  TRY(entropy_testutils_disable_health_tests(&entropy_src));

  // Enable FW override.
  TRY(dif_entropy_src_fw_override_configure(
      &entropy_src,
      (dif_entropy_src_fw_override_config_t){
          .entropy_insert_enable = false,
          .buffer_threshold = kEntropySrcFifoThreshold,
      },
      kDifToggleEnabled));
  // Enable entropy_src.
  TRY(dif_entropy_src_configure(
      &entropy_src,
      (dif_entropy_src_config_t){
          .fips_enable = true,
          .route_to_firmware = false,
          .bypass_conditioner = false,
          .single_bit_mode = single_bit_mode,
          .health_test_threshold_scope = false,
          .health_test_window_size = kEntropySrcHealthTestWindowSize,
          .alert_threshold = UINT16_MAX},
      kDifToggleEnabled));

  return OK_STATUS();
}
...
static void print_uart(const char *buf, size_t len) {
  for (size_t i = 0; i < len; ++i) {
    while (dif_uart_byte_send_polled(uart, buf[i]) != kDifOk)
      ;
  }
  return;
}

// Dump in batches of 1000 samples and then restart
status_t firmware_override_observe_restart(
    int32_t nr_samples, dif_entropy_src_single_bit_mode_t single_bit_mode) {

  const int32_t total_rounds = (nr_samples + 999) / 1000;
  int32_t rounds_to_run = total_rounds;
  const int32_t bits_in_round = 1000 * ((single_bit_mode == kDifEntropySrcSingleBitModeDisabled) ? 4 : 1);
  // Slow computation: do it once.
  const int32_t bits_per_word = CHAR_BIT * sizeof(sample_buffer[0]);
  // We always read 32-bits. TODO: multiply by 4 in 4-bit mode?
  const int32_t words_in_round = (bits_in_round + bits_per_word - 1) / bits_per_word;
  const int32_t bytes_in_round = (bits_in_round + CHAR_BIT - 1) / CHAR_BIT;

  LOG_INFO("==================");
  LOG_INFO("Running restart test in mode %u, retrieve %u words, bytes %u, rounds %u",
           single_bit_mode, words_in_round, bytes_in_round, total_rounds);
  if (words_in_round > kContiguousSamplesCount)
    return INVALID_ARGUMENT();

  // Set timeout to worst case.
  ibex_timeout_t tmo = ibex_timeout_init(6000000);
  // Drain FIFO to make sure we get contiguous samples.
  print_uart("~~~\n", 4);
  uint32_t entropy_full_event = 0;
  uint64_t elapsed = 0;
  bool timeout = false;
  uint32_t uart_counter = 0;
  while (rounds_to_run) {
    // Collect samples from the the observe FIFO.
    int32_t words_to_read = words_in_round;
    uint32_t *sample_buffer_ptr = sample_buffer;

    // Restart the entropy complex.
    entropy_config(single_bit_mode);

    // Collect.
    while (words_to_read > 0 && !(timeout = ibex_timeout_check(&tmo))) {
      size_t len = (size_t)words_to_read;
      uint32_t err_code;
      TRY(dif_entropy_src_get_errors(&entropy_src, &err_code));
      if (err_code)
        LOG_ERROR("entropy_src status. err: 0x%x", err_code);

      // Check FIFO did not overflow during collection.
      entropy_full_event += entropy_src_fifo_has_overflowed();
      TRY(dif_entropy_src_observe_fifo_nonblocking_read(
          &entropy_src, sample_buffer_ptr, &len));
      sample_buffer_ptr += len;
      words_to_read -= len;
    }
    elapsed += ibex_timeout_elapsed(&tmo);
    // Handle partial & timed-out reads
    rounds_to_run--;
    if (timeout)
      break;
    // Print output
    uart_counter = print_hex_uart((uint8_t *)sample_buffer,
                                  (size_t)bytes_in_round, uart_counter);
    // Reset timer for next iteration
    tmo = ibex_timeout_init(6000000);
  }
  print_uart("\n~~~\n", 5);
  TRY_CHECK(!timeout, "Timeout during capture");
  TRY_CHECK(entropy_full_event == 0, "Unexpected stalls at collection %u",
            entropy_full_event);
  // Make sure the FIFO did not overflow.

  uint64_t freq = udiv64_slow(
      ((uint64_t)(total_rounds - rounds_to_run) * (uint64_t)words_in_round) *
          (uint64_t)1000000,
      elapsed, NULL);
  LOG_INFO("done in %ums (~ %usamples/s)",
           (uint32_t)udiv64_slow(elapsed, 1000, NULL), (uint32_t)freq);

  return OK_STATUS();
}

No entropy source errors are reported in the process.

vogelpi commented 4 months ago

I would like to start looking into this issue this week. @timothytrippel , it would be awesome if you could create that separate testcase for simulation. I would then have a perfect basis to start my investigation from (after undoing some recent keymgr and software changes). In parallel, @h-filali is investigating some EDN block-level DV failures which may be related.

timothytrippel commented 4 months ago

@vogelpi : I created a created a branch with the example test: https://github.com/timothytrippel/opentitan/tree/entropy-complex-dbg

You can run it on the CW310 fpga or in DV with the test ROM. I was not able to replicate the failure on the FPGA like I had before, but you mentioned you needed to unwind some HW changes to trigger this most likely? Anywho, the test emulates as closely as possible the entropy complex operations executed by the ft_personalize.c binary. In the process, I also became aware we are probably not being as efficient as possible with the calls, so I plan to clean things up in a follow up PR as well.

Lmk, if you have any questions. Thanks for your help!

vogelpi commented 4 months ago

Thanks again Tim for preparing this test. I've now checked out the repository at 903b31917104b6d9e22f0eb41941a72415109a8a (After merging @h-filali 's PR #22114), then reverted your commit 667b717d38890a04fb2fc29065d4fdd068b23897 (PR 22716, I think the same commit is also part of your referenced dev branch) and then added your commit to get the test application. I've generated the CW310 FPGA bistream and also collected waves using DVSIM. What I've found is the following:

UVM_INFO @ 14791.718670 us: (sw_logger_if.sv:524) [entropy_complex_dbg_sim_dv(sw/device/tests/entropy_complex_dbg.c:165)] Coming out of SW Reset.
UVM_INFO @ 19609.148670 us: (sw_logger_if.sv:370) [tb.u_sim_sram.u_sim_sram_if.u_sw_logger_if.construct_log_and_print.unmblk1.unmblk1.unmblk1.isolation_thread.unmblk1.unmblk1] String arg at addr 20008408: Internal
       5: sw_log.format = $sformatf(sw_log.format, `_ADD_ARGS(5));
                                  |
xmsim: *W,SYSFMW (/home/pirmin/ot/opentitan/scratch/entropy-complex-dbg-3/chip_earlgrey_asic-sim-xcelium/default/src/lowrisc_dv_sw_logger_if_0/sw_logger_if.sv,444|34): $psprintf/$sformatf: Illegal format ignored  %0c.
       5: sw_log.format = $sformatf(sw_log.format, `_ADD_ARGS(5));
                                  |
xmsim: *W,SYSFMW (/home/pirmin/ot/opentitan/scratch/entropy-complex-dbg-3/chip_earlgrey_asic-sim-xcelium/default/src/lowrisc_dv_sw_logger_if_0/sw_logger_if.sv,444|34): $psprintf/$sformatf: Illegal format ignored  %0c.
       5: sw_log.format = $sformatf(sw_log.format, `_ADD_ARGS(5));
                                  |
xmsim: *W,SYSFMW (/home/pirmin/ot/opentitan/scratch/entropy-complex-dbg-3/chip_earlgrey_asic-sim-xcelium/default/src/lowrisc_dv_sw_logger_if_0/sw_logger_if.sv,444|34): $psprintf/$sformatf: Illegal format ignored  %0c.
UVM_ERROR @ 19609.148670 us: (sw_logger_if.sv:526) [entropy_complex_dbg_sim_dv(sw/device/tests/entropy_complex_dbg.c:173)] CHECK-STATUS-fail: MK@:1 = Internal
UVM_INFO @ 19609.148670 us: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER] 

Do you have any idea what the problem could be with DVSIM?

vogelpi commented 4 months ago

I think I got a bit closer: keymgr produces an error code both on FPGA and in simulation. The error code is 0x2 which according to the doc stands for:

2   rw1c    0x0 INVALID_SHADOW_UPDATE   An error observed during shadow register updates, asynchronous error
1   rw1c    0x0 INVALID_KMAC_INPUT  Invalid data issued to kmac interface, synchronous error
0   rw1c    0x0 INVALID_OP  Invalid operation issued to key manager, synchronous error

invalid KMAC input.

moidx commented 4 months ago

Is it failing at this check?

TRY(sc_keymgr_state_check(kScKeymgrStateCreatorRootKey));
vogelpi commented 4 months ago

Yes, that's the failing check @moidx .

vogelpi commented 4 months ago

After many many experiments and discussions with @timothytrippel , we haven't been able to reproduce the original issue again. Neither in simulation nor on FPGA. Still, I think it's useful to report the findings and document what we've tried.

Starting point

What we've tried

Background Information

So to summarize, we've done a lot attempting to reproduce the issue but everything failed. Given how hard it is to reproduce this failure I am thinking that this was probably a sporadic FPGA issue happening for some time but not something we have to keep worrying about. There is one last hope which is to retrieve and retry an FPGA bitstream that revealed the issue previously. @h-filali might still have one, I need to check with him once he's back early next week.

timothytrippel commented 4 months ago

Thanks for all the debugging and documenting this @vogelpi ! I think this sounds like a reasonable course of action for now.

vogelpi commented 4 months ago

We've discussed in the triage meeting to extend the edn_csrng_concurrency test by:

We can then observe the regression dashboard to see how things evolve.

vogelpi commented 4 months ago

The PR to extend the csrng_edn_concurrency test is here: #23277

andreaskurth commented 4 months ago

next: could increase the number of tests or the concurrent CSRNG/EDN executions in nightlies. other than that we're done

vogelpi commented 4 months ago

PR 23314 which adds some further optimizations to speed up the test and increases the number of seeds in the nightlies has been merged now. We'll keep observing the nightlies to see if any failures pop up. I am closing this issue now.