lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.51k stars 745 forks source link

[test-triage] chip_sw_all_escalation_resets #14899

Closed johngt closed 1 year ago

johngt commented 2 years ago

Hierarchy of regression failure

Chip Level

Failure Description

Test chip_sw_all_escalation_resets has 2 failures. 17.chip_sw_all_escalation_resets.1083579376 Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/17.chip_sw_all_escalation_resets/latest/run.log

  File "/root/.cache/bazel/_bazel_default/96f40d218badc03ed33c48f246ec2aa1/external/rules_rust/rust/private/repository_utils.bzl", line 568, column 20, in load_arbitrary_tool
      ctx.extract(

Error in extract: java.io.IOException: Error extracting /root/.cache/bazel/_bazel_default/96f40d218badc03ed33c48f246ec2aa1/external/rust_linux_x86_64/rust-1.60.0-x86_64-unknown-linux-gnu.tar.gz to /root/.cache/bazel/_bazel_default/96f40d218badc03ed33c48f246ec2aa1/external/rust_linux_x86_64: Unexpected end of ZLIB input stream ERROR: /workspace/mnt/repo_top/sw/host/opentitantool/BUILD:10:12: //sw/host/opentitantool:opentitantool depends on @rust_linux_x86_64//:toolchain_for_x86_64-unknown-linux-gnu_impl in repository @rust_linux_x86_64 which failed to fetch. no such package '@rust_linux_x86_64//': java.io.IOException: Error extracting /root/.cache/bazel/_bazel_default/96f40d218badc03ed33c48f246ec2aa1/external/rust_linux_x86_64/rust-1.60.0-x86_64-unknown-linux-gnu.tar.gz to /root/.cache/bazel/_bazel_default/96f40d218badc03ed33c48f246ec2aa1/external/rust_linux_x86_64: Unexpected end of ZLIB input stream ERROR: Analysis of target '//sw/device/tests/sim_dv:all_escalation_resets_test_sim_dv' failed; build aborted: INFO: Elapsed time: 704.250s INFO: 0 processes. FAILED: Build did NOT complete successfully (244 packages loaded, 13962 targets configured) FAILED: Build did NOT complete successfully (244 packages loaded, 13962 targets configured) make: *** [/workspace/mnt/repo_top/hw/dv/tools/dvsim/sim.mk:73: sw_build] Error 1

Steps to Reproduce

This test has been failing over the last 5 days but gets to late 90s percentage. Sept 12 / Sept 11 / Sept 10 / Sept 9. 0% was previous run on Sept 7 (Missing Sept 8) GH Commit: https://github.com/lowrisc/opentitan/tree/3c54b1eb225f685fe528a88992a1b8de3e30cd4b Build seed: 3969941873

Last day with complete failure was https://reports.opentitan.org/hw/top_earlgrey/dv/2022.09.07_23.52.23/report.html

Tests with similar or related failures

johngt commented 2 years ago

This looks like a package / tarball issue, which may mean a file has failed to download completely. Given it's a software task currently, assigning to @engdoreis

engdoreis commented 2 years ago

Command to reproduce.

./util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -w -i chip_sw_all_escalation_resets
engdoreis commented 2 years ago

A quick update, I discovered that an exception is happening inside this funciton when tested IP is the flash_ctrl just before the sv code triggers the fault.

/**
 * Logs `log` and the values that follow in an efficient, DV-testbench
 * specific way, which bypasses the UART.
 *
 * @param log a pointer to log data to log. Note that this pointer is likely to
 *        be invalid at runtime, since the pointed-to data will have been
 *        stripped from the binary.
 * @param nargs the number of arguments passed to the format string.
 * @param ... format parameters matching the format string.
 */
void base_log_internal_dv(const log_fields_t *log, uint32_t nargs, ...) {
  mmio_region_t log_device = mmio_region_from_addr(kDeviceLogBypassUartAddress);
  mmio_region_write32(log_device, 0x0, (uintptr_t)log);

  va_list args;
  va_start(args, nargs);
  for (int i = 0; i < nargs; ++i) {
    mmio_region_write32(log_device, 0x0, va_arg(args, uint32_t));
  }
  va_end(args);
}

image

engdoreis commented 2 years ago

@luismarques for visibility

engdoreis commented 2 years ago

Test

This test is doing the following for several IPs:

  1. Initialize the IP and enable NMI for alert handler in the .c
  2. Signalize to the dv.sv that it is ready for a fault injection and then execute a wfi.
  3. Several external IRQs and NMI IRQs are fired, computed and stored in the flash to be verified later.
  4. Reboot due to the escalation.
  5. Check that at least one external IRQ and one NMI has been fired before the reboot.

Issue

The test is failing only for the flash_ctrl IP and after deep investigation I discovered that the Ibex is triggering an exception cause=01 (instruction access fault), at the instruction addi sp,sp,32 and exactly at the time that the flash_ctrlr fault is injected. After discussion with @GregAC we came to the conclusion that the flash is blocking instructions from being fetched due to the fault.

Solutions

Here are some suggested approaches to tackle this issue:

  1. Move the test to execute in the sram (only the flash_ctrl, otherwise the sram_ctrl test would have the same issue).
  2. Move the flash_ctrl test to a different test that would use more system verilog.
  3. Move the flash_ctrl test to a different test running in the sram.

@tjaychen @matutem Plese let me know your thoughts.

weicaiyang commented 1 year ago

@engdoreis did you investigate for this error message? The build error is gone now.

UVM_ERROR @ 8760.268292 us: (sw_logger_if.sv:522) [all_escalation_resets_test_prog_sim_dv(w/device/tests/sim_dv/all_escalation_resets_test.c:915)] CHECK-fail: Expected at least one regular interrupt

engdoreis commented 1 year ago

Yes, exactly this error.

weicaiyang commented 1 year ago

Yes, exactly this error.

Thanks @engdoreis. Let me reassign to @matutem to fix it.

matutem commented 1 year ago

@tjaychen and @matutem discussed a couple options:

  1. Run the case of flash_ctrl error injection from ROM.
  2. Add alert crash dump capture to confirm the alert was received.

The first option does not address all issues since the test also records whether the regular interrupt and NMI were received, and that is recorded in flash. Also, the ISRs check the alert is recorded correctly.

Option 2 has the advantage that the alert crash dump provides extra confirmation the alert is correct, so it benefits all cases, and reduces the need of having run the ISRs. Notice however all errors excerpt for those in flash_ctrl will check that the ISRs were run and all checks they perform were successful. In case of flash_ctrl errors the test will ignore the check for ISRs having run.