[sival] rv_core_ibex_nmi_irq_test_silicon_owner_sival_rom_ext failure

mundaym commented 6 months ago

Description

Enabling alert_handler ping mechanism results in alert_handler NMI without any reported local of regular alerts. The only way to recover from this is by resetting the device.

pamaury commented 6 months ago

I think @nbdd0121 has investigated the test in more depth and might comment here?

andreaskurth commented 6 months ago

@moidx is currently investigating this. If there are any comments from others please provide them

moidx commented 6 months ago

Configuring the alert_handler triggers nmi interrupt without any alert class association.

https://github.com/lowRISC/opentitan/blob/605da270ddcf70b2af5573e9524f9613bc89383d/sw/device/tests/rv_core_ibex_nmi_irq_test.c#L98-L101

The alert NMI is triggered even when there are no pending alerts. Verified this by removing the force alert.

https://github.com/lowRISC/opentitan/blob/605da270ddcf70b2af5573e9524f9613bc89383d/sw/device/tests/rv_core_ibex_nmi_irq_test.c#L203-L204

Disabling the ping timer removes the spurious NMI issue:

https://github.com/lowRISC/opentitan/blob/605da270ddcf70b2af5573e9524f9613bc89383d/sw/device/lib/testing/alert_handler_testutils.c#L139-L141

The following write in dif_alert_handler.c is what triggers the NMI from alert_handler:

  if (enabled == kDifToggleEnabled) {
    mmio_region_write32_shadowed(
        alert_handler->base_addr,
        ALERT_HANDLER_PING_TIMER_EN_SHADOWED_REG_OFFSET, 1);
  }

moidx commented 6 months ago

I tried switching the test to use a recoverable alert, different escalation sequences and a valid ping timeout timer configuration, but none of these changes helped to move the alert handler NMI.

I will submit these test updates separately, but I think at this point I am going to flag this for CDC analysis.

CC: @a-will @matutem @nbdd0121 who also took a look at this issue.

moidx commented 6 months ago

I am going to measure the delay between enabling the ping mechanism and the NMI to try to determine if this is an issue with the reverse ping mechanism supported by the receivers.

This was recommended by @msfschaffner.

moidx commented 6 months ago

Measured alert NMI trigger time

Running the test with the alerts and ping configured, and without triggering the alert results in the first alert NMI triggering between 500-4000 microseconds.

This as measured with rv_timer using an equivalent 1us tick.

alert_hander NMI trigger time when alert ping is disabled

The interrupt fires consistently within 8-9 us. This seems to indicate that the issue is due to a ping timeout.

Reverse ping timeout calculation

The reverse ping timeout calculation is done using the following formula available in prim_esc_receiver:

4  * N_ESC_SEV * (2 * 2 * 2^PING_CNT_DW)

pwrmgr is the only block consuming the N_ESC_SEV and PING_CNT_DW compile time parameters:

alert_handler_reg_pkg::N_ESC_SEV = 4
alert_handler_reg_pkg::PING_CNT_DW = 16

The alert escalation responder inside pwrmgr is connected to the io_div4 clock, yielding a target 24MHz frequency. The result expected timeout based on the above parameters is thus:

reverse_ping_timeout = 0.175s = (4 * 4 ( 2 * 2 * 2^16)) / 24e6

The interrupt trigger measurement does not seem to rule out any potential issues with the reverse ping mechanism.

alert_handler configuration

  uint32_t cycles[3] = {0};
  CHECK_STATUS_OK(alert_handler_testutils_get_cycles_from_us(
      kEscalationPhase0Micros, &cycles[0]));
  CHECK_STATUS_OK(alert_handler_testutils_get_cycles_from_us(
      kEscalationPhase2Micros, &cycles[1]));
  CHECK_STATUS_OK(alert_handler_testutils_get_cycles_from_us(kIrqDeadlineMicros,
                                                             &cycles[2]));
  dif_alert_handler_escalation_phase_t esc_phases[] = {
      {.phase = kDifAlertHandlerClassStatePhase0,
       .signal = 0,
       .duration_cycles = cycles[0]},
      {.phase = kDifAlertHandlerClassStatePhase1,
       .signal = 3,
       .duration_cycles = cycles[1]}};
  dif_alert_handler_class_config_t class_config[] = {{
      .auto_lock_accumulation_counter = kDifToggleDisabled,
      .accumulator_threshold = 0,
      .irq_deadline_cycles = cycles[2],
      .escalation_phases = esc_phases,
      .escalation_phases_len = ARRAYSIZE(esc_phases),
      .crashdump_escalation_phase = kDifAlertHandlerClassStatePhase2,
  }};

  dif_alert_handler_alert_t alerts[] = {kTopEarlgreyAlertIdAesRecovCtrlUpdateErr};
  dif_alert_handler_class_t alert_classes[] = {kDifAlertHandlerClassA};
  dif_alert_handler_class_t classes[] = {kDifAlertHandlerClassA};
  dif_alert_handler_config_t config = {
      .alerts = alerts,
      .alert_classes = alert_classes,
      .alerts_len = ARRAYSIZE(alerts),
      .classes = classes,
      .class_configs = class_config,
      .classes_len = ARRAYSIZE(class_config),
      .ping_timeout = 0x100,
  };

moidx commented 6 months ago

@andreaskurth, this test is reproducible without a ROM_EXT running. @a-will suggested we can try to get a DV test configuration ready to run in GLS in case this is something we want to try.

CC: @sha-ron @OTshimeon

andreaskurth commented 5 months ago

By @moidx: We can run chip_sw_rv_core_ibex_nmi_irq with test ROM on the netlist. We need to update the testcase to not trigger a fake alert, just wait. Then we shouldn't see any NMIs and can wait for the timeout. We should run this GLS over the next weekend. @moidx will create a test case.

moidx commented 5 months ago

Created the test case in #23441 and added test point to the GLS test plan. If unable to debug further on Z1, I propose we close this issue and consider removing dropping the ping mechanism from A1 if we continue to run into problems during bring-up. We can make this decision as part of M5 triage.

@johannheyszl FYI, since we'll have to test alert handler behavior with pinging mechanism disabled for Z1.

vogelpi commented 5 months ago

Discussed during triage meeting. Okay to move this to M5.

johannheyszl commented 5 months ago

@moidx thanks for the heads up. IMHO this is OK for the testing we currently do, i.e. not fully invasive and cutting wires.

andreaskurth commented 4 months ago

Moving this to M6 as P1. It should be tested early on the final netlist. CC @sha-ron, we'll discuss this in our next meeting

moidx commented 3 months ago

https://github.com/lowRISC/opentitan/issues/24119 tracks the findings after running the sw_alert_handler_ping_ok test post synthesis.

vogelpi commented 3 months ago

@moidx We could add this test to the GLS test plan as a P2. It should now be passing now with the latest ECO fixes in.

andreaskurth commented 3 months ago

Confirmed that chip_sw_rv_core_ibex_nmi_irq is on the GLS testplan. Suggest moving this to M7 (still as P1).

moidx commented 2 months ago

Closing this issue as the sw_alert_handler_ping_ok is now passing in GLS. We can create a new issue if rv_core_ibex_nmi_req_irq_test we are able to run the test in GLS and results in failure.

lowRISC / opentitan