lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.56k stars 762 forks source link

[prodc-integration] Issue with System Reset Controller behavior for EC_RST_L on deep sleep wake #22204

Closed jettr closed 5 months ago

jettr commented 7 months ago

Description

During our testing of the EC_RST_L pin (IOR8) and the System Reset Controller, it appears that the ES chip will incorrectly assert EC_RST_L low in ROM (or hardware) while the chip is waking up from deep sleep. We need the EC_RST_L signal to be maintained as highZ while entering and exiting deep sleep for the OT chip.

The desired behavior for EC_RST_L from System Reset Controller is

Here is a logic analyzer capture of the EC_RST_L signal on ES chip, where PWR_BTN_L is a deep sleep wake up source. EC_RST_L correctly maintains the highZ value upon deep sleep enter and for the duration of deep sleep. When the chip is waking up, however, EC_RST_L is incorrectly asserted low. This is a really important feature for us, so we definitely need to figure out what is going on and how we can fix it.

image

pamaury commented 7 months ago

I have done the sival implementation for this block and I think the above behaviour matches the documentation which says:

The ec_rst_l_i / ec_rst_l_o signals are connected to the same bidirectional pin of the OpenTitan chip, and are used to either reset the embedded controller (EC), or to detect self-reset of the EC and stretch the reset pulse (hence the bidirectional nature of this pin). This output is always asserted when sysrst_ctrl is reset (allowing its use as a power-on reset) and remains asserted until released by software.

Since the block is reset when coming of deep sleep, it seems to follow that the pin is asserted low as expected. I agree though that this might not be the desired behaviour.

a-will commented 7 months ago

Did you configure the pinmux to hold the pad in high-z state upon deep sleep entry (then only clear deep sleep mode after setting up the override for ec_rst_l in sysrst_ctrl)?

I believe the pinmux is the peripheral to target if you want to have different behaviors for pads for deep sleep vs. power-on reset.

jettr commented 7 months ago

Since the block is reset when coming of deep sleep, it seems to follow that the pin is asserted low as expected. I agree though that this might not be the desired behavior.

I am hopeful that we can figure out a way in sysrst_ctrl hardware to differentiate between PoR reset and deep sleep resume reset. It is easy to imagine how we ended at our current implementation, and I am hoping we can change the behavior before it is too late.

Did you configure the pinmux to hold the pad in high-z state upon deep sleep entry (then only clear deep sleep mode after setting up the override for ec_rst_l in sysrst_ctrl)?

From what I can gather from the pinmux for ES, IOR8/EC_RST_L is directly connected to sysrst_ctrl out as input and output so the pinmux cannot hold a value on EC_RST_L during deep sleep. Please let me know if I am overlooking something though.

a-will commented 7 months ago

Did you configure the pinmux to hold the pad in high-z state upon deep sleep entry (then only clear deep sleep mode after setting up the override for ec_r From what I can gather from the pinmux for ES, IOR8/EC_RST_L is directly connected to sysrst_ctrl out as input and output so the pinmux cannot hold a value on EC_RST_L during deep sleep. Please let me know if I am overlooking something though.

That's not the correct attribute to look at. DIOs support pinmux overrides for deep sleep as well. We use this for USB deep sleep support. It is just the input and output muxes for connecting peripherals that reduce to nothing for DIOs vs. MIOs.

https://github.com/lowRISC/opentitan/blob/18b0cedda19564f347d2ec0c1b9988ee16bd0d10/hw/top_earlgrey/sw/autogen/top_earlgrey.h#L1640

https://opentitan.org/book/hw/ip/pinmux/doc/registers.html#dio_pad_sleep_mode

jettr commented 7 months ago

Thanks for pointing that out. I am still having trouble making this work. The DIO_PAD_SLEEP_EN register has the following description:

Deep sleep mode enable. If this bit is set to 1 the corresponding pad will enable the sleep behavior specified in DIO_PAD_SLEEP_MODE upon deep sleep entry, and the corresponding bit in DIO_PAD_SLEEP_STATUS will be set to 1. The pad remains in deep sleep mode until the corresponding bit in DIO_PAD_SLEEP_STATUS is cleared by SW. Note that if an always on peripheral is connected to a specific DIO pad, the corresponding DIO_PAD_SLEEP_EN bit should be set to 0.

but it appears that something in hardware or early ROM is clearing DIO_PAD_SLEEP_STATUS or the equivalent since EC_RST_L is already asserted low before ROM_EXT even starts executing.

I tried out using the DIO_PAD_SLEEP_MODE.Tie-Low just to see if I was using this register correctly, and I do see that EC_RST_L will go low as soon as the chip goes in to deep sleep, which is a good test to see that I am using the register correctly.

TiedLow

When I use DIO_PAD_SLEEP_MODE.Tie-High, the scope capture looks the same -- EC_RST_L is asserted low on wakeup before ROM_EXT starts executing.

TiedHigh

Thoughts on anything else I can try?

a-will commented 7 months ago

~Looks like this might be a bug (unconfirmed). It seems that the sleep enablement is getting reset by a non-aon reset. This clock and reset choice for sleep_en_q looks like a bug to me:~

https://github.com/lowRISC/opentitan/blob/51918fbffefa4394a25131f91be4ec9ba2093020/hw/ip/pinmux/rtl/pinmux.sv#L414

~It causes downstream writes (which happen every cycle) to sleep_status CSRs, which then control the overrides.~

Edit: Ah, never mind, I swapped the D and DE bits of the relevant CSR when I was reading. I'll poke around when I return to the office.

a-will commented 7 months ago

Make sure you are setting DIO_PAD_SLEEP_EN to 1 for ec_rst_l's pad. I think that blurb about an AON peripheral needing it to be 0 is meant for peripherals that are supposed to control the pad while the chip is in deep sleep.

For this case, we specifically want the pinmux to take over until sysrst_ctrl is reconfigured.

EC_RST_L going low isn't a sign that the sleep mode is working. sysrst_ctrl will cause 0 to be driven on deep sleep entry without sleep mode enabled.

jettr commented 7 months ago

Make sure you are setting DIO_PAD_SLEEP_EN to 1 for ec_rst_l's pad. I think that blurb about an AON peripheral needing it to be 0 is meant for peripherals that are supposed to control the pad while the chip is in deep sleep.

For this case, we specifically want the pinmux to take over until sysrst_ctrl is reconfigured

Agreed, I did set DIOPAD_SLEEP_EN to 1 for these tests/captures.

andreaskurth commented 7 months ago

This needs to be fixed in either pinmux or sysrst_ctrl. We should also test it on the FPGA. Assigning to M4 with P1.

jettr commented 7 months ago

Note about fixing in pinmux vs sysrst_ctrl: We still want the key combo detection for EC_RST_L assertion to work while OT is in deep sleep. Having pinmux drive a value during deep sleep may prevent sysrst_ctrl from asserting EC_RST_L externally while OT is in deep sleep (and/or waking up from deep sleep since this keycombo will be a wake event).

a-will commented 7 months ago

Note about fixing in pinmux vs sysrst_ctrl: We still want the key combo detection for EC_RST_L assertion to work while OT is in deep sleep. Having pinmux drive a value during deep sleep may prevent sysrst_ctrl from asserting EC_RST_L externally while OT is in deep sleep (and/or waking up from deep sleep since this keycombo will be a wake event).

Ah, so you're saying that EC_RST_L does need to be able to assert while in deep sleep (just not from the stretching module?). Then the pinmux shouldn't be the one to take over.

The features of this module are a little confusing. I'll need to stop reading code on a cell phone. Too many mistakes, hehe.

I see that all the CSRs and logic (except alerts, the TL-UL interface, and the CSR reporting current pin values) are on the AON domain, so it's really curious what is getting reset during the wakeup.

a-will commented 7 months ago

Wait... if the pinmux loses mux selections after waking up from deep sleep (low-power-exit resets those CSRs), do we end up triggering key combos that cause EC_RST_L to assert?

sysrst_ctrl's CSRs should not reset, but its MIO-driven inputs would start getting driven low (since the input muxes would now be set to Tied-Zero after the reset).

@jettr Are we disabling all combos that aren't wakeup-related when entering deep sleep? (especially ones that could cause EC_RST_L to assert, since that's probably critical here, and there's a similar problem for anything that would trigger a reset request)

pamaury commented 7 months ago

After discussion with @a-will, we identified a potential trap that could cause your issue (see below). Can you tell us a bit more about the sysrst_ctrl configuration that you are using? In particular, what exactly are combos set up when you enter deep sleep?

Trap (this really needs to be documented): on deep sleep exit, the pinmux will reset all the MIO CSRs which will reset all the MIO configuration. If one enters deep sleep with a sysrst_ctrl combo active, for example "if key0 goes low then assert ec_rst_l" then that combo could trigger due to the MIO reset. Basically, the only combo that are safe to use in deep sleep mode are those that trigger a wake up. Nefore entering deep sleep, the code should disable combos that are "unsafe" and re-enable them on deep sleep exit after configuring the MIOs again. Is that a problem for your use case?

a-will commented 7 months ago

For the pinmux reset behavior, I might be jumping to conclusions here. We should confirm with @andreaskurth , as it's not clear to me when pinmux's CSRs that aren't on the AON clock get reset.

They're hooked up to an AON power domain LC reset sync'd to the io_div4 clock. I got confused when I didn't see the AON clock domain there in the hjson, but the clock domain is separate from the power and reset domains. It's possible these CSRs don't get reset on wakeup.

a-will commented 7 months ago

Okay, I checked waveforms, and the pinmux CSRs don't get reset. Sorry, false alarm!

So the behavior seen here is still not root caused. However, the behaviors seen above do still look like we're getting a full reset for some reason.

jettr commented 6 months ago

Both of the key combos we set up should be valid during deep sleep and should be used as wakeup sources.

Here is the configuration we are are using for they two key combos:


        self.registers.auto_block_debounce_ctl.write(AUTO_BLOCK_DEBOUNCE_CTL::AUTO_BLOCK_ENABLE::SET
                + AUTO_BLOCK_DEBOUNCE_CTL::DEBOUNCE_TIMER.val(us_to_aon_ticks_minus_one!(20)));
        self.registers.auto_block_out_ctl.write(AUTO_BLOCK_OUT_CTL::KEY1_OUT_SEL::SET + AUTO_BLOCK_OUT_CTL::KEY1_OUT_VALUE::CLEAR);
        self.registers.key_invert_ctl.write(KEY_INVERT_CTL::KEY1_IN::SET);
        self.registers.com_sel_ctl[0].write(COM_SEL_CTL::PWRB_IN_SEL_0::SET + COM_SEL_CTL::KEY0_IN_SEL_0::SET);
        self.registers.com_det_ctl[0].write(COM_DET_CTL::DETECTION_TIMER_0.val(ms_to_aon_ticks!(10)));
        self.registers.com_sel_ctl[1].write(COM_SEL_CTL::PWRB_IN_SEL_0::SET
                + COM_SEL_CTL::KEY0_IN_SEL_0::SET
                + COM_SEL_CTL::KEY2_IN_SEL_0::SET);
        self.registers.com_det_ctl[1].write(COM_DET_CTL::DETECTION_TIMER_0.val(ms_to_aon_ticks!(10_000)));

We also set self.registers.ulp_ctl.write(ULP_CTL::ULP_ENABLE::SET); before going into deep sleep.

pamaury commented 6 months ago

I am trying to reproduce the issue. So far I have been unsuccessful: in #22329 I modified the existing ulp_z3_wakeup test that exercises deep sleep and wakeup to verify the bahviour of the EC reset pin. On both the FPGA and silicon, the pin stays high on wakeup. I will continue to try to reproduce it.

jettr commented 6 months ago

I also do not see the issue when running bazel test --test_output=streamed --cache_test_results=no --//signing:token=//signing/tokens:cloud_kms_prodc //sw/device/tests:sysrst_ctrl_ulp_z3_wakeup_test_silicon_owner_prodc_rom_ext on the earlgrey_es_sival branch either.

The UART captures of the chip during the tests do look different, and I think it is a thread worth pulling on.

Our failing test

ti50_test

versus

UPL_Z3_WAKEUP passing test

upl_z3_wakeup

I noticed that our test framework outputs some giberish (hash?) characters on UART right before asserting EC_RST_L low:

InitialCharacter

The ulp_z3_wakeup test does not have that UART traffic at all and it never asserts EC_RESET_L upon wakeup. @cfrantz or @moidx do you know what is printing the UART traffic from the above captures upon deep sleep resume, and why doesn't the sysrst_ctrl_ulp_z3_wakeup_test_silicon_owner_prodc_rom_ext test on earlgrey_es_sival also print those characters out?

a-will commented 6 months ago

@jettr I don't think the chip is doing a deep sleep resume there. It looks like a full reset. There is probably something off with the configuration of the integration firmware (either the config is actually bugged or the chip's responses to it are).

Can you capture the reset reasons after the intended wake from deep sleep? These will be stored in the retention RAM.

It might also be good to decode those UART characters at the appropriate bit rate. It might tell us more. The logic analyzer's software there looks confused, like it's expecting something much slower than is coming out of the UART. Printing from the ROM often indicates boot failure for some reason. A line beginning with "BFV" would be an indicator (Boot Failure Value).

jettr commented 6 months ago

Thanks for the tip on the UART speed. I wasn't expecting the ROM uart to be 1mpbs. Decoding the UART message gives:

BFV:02495202
LCV:2318c631
VER:8522e05b

where the BFV decodes to Error(2, kModuleInterrupt, kUnknown) and /* The high-byte of kErrorInterrupt is ... the interrupt cause */

The LCV decodes to Prod (0b'10001 repeated multiple times), which seems normal/expected.

What interrupts cause has the value of 2 for kErrorInterrupt?

a-will commented 6 months ago

An mcause of 2 would be an Illegal Instruction exception--The MSB is preserved to distinguish between exceptions and interrupts, and 0 is an exception. That seems worrisome.

Unfortunately, you'd need mtval to find out the address of the failure, and I'm not sure if we have anything available for debugging that in the prod life cycle state.

moidx commented 6 months ago

Hi @jettr, @jesultra, is there are chance we can reproduce this issue with the FPGA to be connect a debugger?

a-will commented 6 months ago

Hm... Just a guess: The UART messages coming out at 1 mbaud suggests we didn't actually manage to reset and enter powerdown before hitting the ROM's exception vector. The UART CSRs should get reset as soon as pwrmgr starts shutting stuff off for powerdown.

Is the integration firmware providing a vector for exceptions? Is it also going into WFI, turning off the watchdog, etc.? Because it could be an interesting sequence of events where we end up in the ROM's exception vector, and it causes a full reset instead of the deep sleep -> wake -> partial reset sequence.

moidx commented 6 months ago

The ROM exception vector will cause an exception due to the ePMP configuration.

jesultra commented 6 months ago

Amaury wrote:

Since [sysrst_ctrl] is reset when coming of deep sleep, it seems to follow that the pin is asserted low as expected. I agree though that this might not be the desired behaviour.

This could be a problem. If a ChromeOS device has a special security function to be activated by holding down some key for 10 seconds, and then seven seconds after the user pressed the button, OT wakes up from deep sleep because of some scheduled wakeup, we still want sysrst_ctrl to continue counting the seconds, and fire some action three seconds later.

a-will commented 6 months ago

Amaury wrote:

Since [sysrst_ctrl] is reset when coming of deep sleep, it seems to follow that the pin is asserted low as expected. I agree though that this might not be the desired behaviour.

This could be a problem. If a ChromeOS device has a special security function to be activated by holding down some key for 10 seconds, and then seven seconds after the user pressed the button, OT wakes up from deep sleep because of some scheduled wakeup, we still want sysrst_ctrl to continue counting the seconds, and fire some action three seconds later.

Hehe, you're coming in a little late, and we were all confused. sysrst_ctrl is on the AON power domain, and all its resets are also on the AON reset domain. The core functions do not get reset when coming out of deep sleep.

There are some functions that are on a clock that is off during deep sleep, though, and I think those do get reset. However, those should be limited to the interrupt CSRs and status flops that are only useful to a powered CPU.

What's happening here is an exception that causes a full reset (before ever reaching deep sleep, by the looks of it).

jettr commented 6 months ago

is there are chance we can reproduce this issue with the FPGA to be connect a debugger

I can reproduce this on my cw310; this is my first time using cw310 so I haven't attached a debugger yet. Can you point me in the right direction?

What's happening here is an exception that causes a full reset (before ever reaching deep sleep, by the looks of it).

Why do you think there is an exception happening when going into deep sleep versus coming out of deep sleep? Deep sleep entry seems to be working from the fact that the power consumption is much lower than the normal sleep power consumption. The chip seems to be able to stay in this lower power consumption indefinitely until one of the wake up sources is toggled. At that point the chip reports the exception code.

Is the integration firmware providing a vector for exceptions?

We are writing to mtvec CSR with the 32 element look up pretty early in owner firmware.

Is it also going into WFI, turning off the watchdog, etc.?

Here is a high level idea of what we are doing to enter deep sleep:

fn enter_deep_sleep(&self) {
        // Wait until main UART HW FIFO is flushed
        flush_uart();
        // Disable interrupts to prepare for sleep, following the steps from
        // https://opentitan.org/book/hw/top_earlgrey/ip_autogen/pwrmgr/doc/programmers_guide.html
        CSR.mstatus.modify(mstatus::mie::CLEAR);

        PINMUX.clear_wkup_cause();
        PINMUX.enable_wakeup_pin(WakeupPin::PltRst, LpmWake::LevelHigh);
        PINMUX.enable_wakeup_pin(WakeupPin::LidOpen, LpmWake::EdgeRising);

        // EN_0=SYSRST_CTRL_AON_WKUP
        // EN_1=ADC_CTRL_AON_WKUP
        // EN_2=PINMUX_AON_PIN_WKUP
        self.registers.wakeup_en[0]
            .write(WAKEUP_EN::EN_0::SET + WAKEUP_EN::EN_1::SET + WAKEUP_EN::EN_2::SET);
        self.registers.wake_info.set(self.registers.wake_info.get());
        self.registers
            .wake_info_capture_dis
            .write(WAKE_INFO_CAPTURE_DIS::VAL::CLEAR);

        // No domain is enable, so deep sleep
        self.registers.control.set(0);
        self.registers.control.modify(CONTROL::LOW_POWER_HINT::SET);
        // Wait for wakeup_en and control writes.
        self.cfg_cdc_sync();
        // This WFI will attempt to sleep, because we set LOW_POWER_HINT.
        unsafe { rv32i::support::wfi() };
        // Does not return since in deep sleep
     }

As for watchdog, we are enabling the option to pause the watchdog during sleep

a-will commented 6 months ago

What's happening here is an exception that causes a full reset (before ever reaching deep sleep, by the looks of it).

Why do you think there is an exception happening when going into deep sleep versus coming out of deep sleep? Deep sleep entry seems to be working from the fact that the power consumption is much lower than the normal sleep power consumption. The chip seems to be able to stay in this lower power consumption indefinitely until one of the wake up sources is toggled. At that point the chip reports the exception code.

Once pwrmgr starts going down for deep sleep (after completing checks, like flash writes have completed), it asserts the non-AON resets. The UART's baud rate should get reset, but we're seeing ROM (or ROM_EXT?) code executed with the baud rate that is specific to this application firmware.

The ROM and ROM_EXT both configure the baud rate fairly early in the code, and it's for 115.2 kbaud : https://github.com/lowRISC/opentitan/blob/9f4903e77a8567a289e22d4c788822afb3a89cc6/sw/device/silicon_creator/rom/rom.c#L135-L136

Is it also going into WFI, turning off the watchdog, etc.?

Here is a high level idea of what we are doing to enter deep sleep:

fn enter_deep_sleep(&self) {
        // Wait until main UART HW FIFO is flushed
        flush_uart();
        // Disable interrupts to prepare for sleep, following the steps from
        // https://opentitan.org/book/hw/top_earlgrey/ip_autogen/pwrmgr/doc/programmers_guide.html
        CSR.mstatus.modify(mstatus::mie::CLEAR);

        PINMUX.clear_wkup_cause();
        PINMUX.enable_wakeup_pin(WakeupPin::PltRst, LpmWake::LevelHigh);
        PINMUX.enable_wakeup_pin(WakeupPin::LidOpen, LpmWake::EdgeRising);

        // EN_0=SYSRST_CTRL_AON_WKUP
        // EN_1=ADC_CTRL_AON_WKUP
        // EN_2=PINMUX_AON_PIN_WKUP
        self.registers.wakeup_en[0]
            .write(WAKEUP_EN::EN_0::SET + WAKEUP_EN::EN_1::SET + WAKEUP_EN::EN_2::SET);
        self.registers.wake_info.set(self.registers.wake_info.get());
        self.registers
            .wake_info_capture_dis
            .write(WAKE_INFO_CAPTURE_DIS::VAL::CLEAR);

        // No domain is enable, so deep sleep
        self.registers.control.set(0);
        self.registers.control.modify(CONTROL::LOW_POWER_HINT::SET);
        // Wait for wakeup_en and control writes.
        self.cfg_cdc_sync();
        // This WFI will attempt to sleep, because we set LOW_POWER_HINT.
        unsafe { rv32i::support::wfi() };
        // Does not return since in deep sleep
     }

As for watchdog, we are enabling the option to pause the watchdog during sleep

I've only glanced so far, but I'd place a loop around that WFI if I were you. Despite its name, WFI exit can happen for more than just interrupts, including a debug request (which isn't relevant here, but if you start debugging in the dev life cycle state, you'll be in for a surprise, hehe).

a-will commented 6 months ago
CSR.mstatus.modify(mstatus::mie::CLEAR);

Also, the code above only clears mstatus.mie, which merely disables interrupt handling. Any interrupt that comes in will still cause the CPU to exit from WFI.

If you wanted to prevent interrupts from causing a WFI exit, you'd need to clear the mie CSR (which is different from mstatus.mie).

jettr commented 6 months ago

the UART's baud rate should get reset, but we're seeing ROM (or ROM_EXT?) code executed with the baud rate that is specific to this application firmware

The application specific (i.e. ChromeOS) FW baud is 115200 just like ROM and ROM_EXT; I have no idea where the 1mpbs speed is coming from.

Also when I replicate this on the FPGA, the baud rate of the BFV print is actually 115200 (versus the 1mpbs on actual silicon).

I've only glanced so far, but I'd place a loop around that WFI if I were you. Despite its name, WFI exit can happen for more than just interrupts, including a debug request (which isn't relevant here, but if you start debugging in the dev life cycle state, you'll be in for a surprise, hehe).

Thanks for the helpful tip :) We actually do have other code after that that handles the case when the chip doesn't go to sleep; I just didn't include it for simplicity.

jettr commented 6 months ago
CSR.mstatus.modify(mstatus::mie::CLEAR);

Also, the code above only clears mstatus.mie, which merely disables interrupt handling. Any interrupt that comes in will still cause the CPU to exit from WFI.

If you wanted to prevent interrupts from causing a WFI exit, you'd need to clear the mie CSR (which is different from mstatus.mie).

Agreed to all of this :) We have other code that handles this too. We don't seem to be having a problem entering or staying in deep sleep on this issue though.

a-will commented 6 months ago

the UART's baud rate should get reset, but we're seeing ROM (or ROM_EXT?) code executed with the baud rate that is specific to this application firmware

The application specific (i.e. ChromeOS) FW baud is 115200 just like ROM and ROM_EXT; I have no idea where the 1mpbs speed is coming from.

Also when I replicate this on the FPGA, the baud rate of the BFV print is actually 115200 (versus the 1mpbs on actual silicon).

Ahh, so it gets more exciting, lol. Okay, hm.... Maybe our observations would be consistent with an exception that occurs prior to uart_init() in the ROM (but after wakeup?).

I'll check what the actual default values are for the UART IP out of reset.

moidx commented 6 months ago

Also when I replicate this on the FPGA, the baud rate of the BFV print is actually 115200 (versus the 1mpbs on actual silicon).

The ROM and ROM_EXT compiled for FPGA use 115200. The ROM uses 1M in Z1, but we believe this to be a bug in the ROM.

a-will commented 6 months ago

Also when I replicate this on the FPGA, the baud rate of the BFV print is actually 115200 (versus the 1mpbs on actual silicon).

The ROM and ROM_EXT compiled for FPGA use 115200. The ROM uses 1M in Z1, but we believe this to be a bug in the ROM.

Oh man, that fact changes everything, lol. Okay, that means we woke up from deep sleep, and rom_init() did execute. So this does look like something bad happened in ROM.

jesultra commented 6 months ago

Using JTAG debugging, I have observed this exception happening in ROM as the chip came out of deep sleep. It is worth noting that Ti50 would have configured the watchdog, and used REGWEN to prevent further changes. I do not know if attempting to write to a register which is locked through REGWEN is ignored, or generates an exception. In the latter case, that certainly explains it.

So the question is, should ROM attempt to do anything with watchdog, when coming out of deep sleep? Presumably it was set up on the original cold boot. In fact any IP on the AON domain would have retained its configuration, and likely would need special treatment in ROM to distinguish (if ROM touches any other such IPs.)

Breakpoint 1, rom_irq_error () at sw/device/silicon_creator/rom/rom.c:92
92    CSR_READ(CSR_REG_MCAUSE, &mcause);
(gdb) bt
#0  rom_irq_error () at sw/device/silicon_creator/rom/rom.c:92
#1  rom_interrupt_handler () at sw/device/silicon_creator/rom/rom.c:579
#2  0x0000eac6 in sec_mmio_write32 (addr=1078394916, value=<optimized out>)
    at sw/device/silicon_creator/lib/base/sec_mmio.c:103
#3  0x0000bd94 in watchdog_configure (config=...)
    at sw/device/silicon_creator/lib/drivers/watchdog.c:87
#4  0x0000bd48 in watchdog_init (lc_state=<optimized out>)
    at sw/device/silicon_creator/lib/drivers/watchdog.c:64
#5  0x000088d6 in rom_init () at sw/device/silicon_creator/rom/rom.c:160
#6  rom_main () at sw/device/silicon_creator/rom/rom.c:561
#7  0x0000841c in _rom_start_boot () at sw/device/silicon_creator/rom/rom_start.S:460
Backtrace stopped: frame did not save the PC
(gdb) frame 2
#2  0x0000eac6 in sec_mmio_write32 (addr=1078394916, value=<optimized out>)
    at sw/device/silicon_creator/lib/base/sec_mmio.c:103
103   upsert_register(addr, masked_value);
(gdb) list
98  
99  void sec_mmio_write32(uint32_t addr, uint32_t value) {
100   abs_mmio_write32(addr, value);
101   uint32_t masked_value = value ^ kSecMmioMaskVal;
102   barrier32(masked_value);
103   upsert_register(addr, masked_value);
104   HARDENED_CHECK_EQ((abs_mmio_read32(addr) ^ kSecMmioMaskVal), masked_value);
105 
106   ++sec_mmio_ctx.write_count;
107 }
(gdb) up
#3  0x0000bd94 in watchdog_configure (config=...)
    at sw/device/silicon_creator/lib/drivers/watchdog.c:87
87    sec_mmio_write32(kBase + AON_TIMER_WDOG_BITE_THOLD_REG_OFFSET,
(gdb) list
82    // Set the watchdog bite and bark thresholds.
83    sec_mmio_write32(kBase + AON_TIMER_WDOG_CTRL_REG_OFFSET, kCtrlDisable);
84    abs_mmio_write32(kBase + AON_TIMER_WDOG_COUNT_REG_OFFSET, 0);
85    abs_mmio_write32(kBase + AON_TIMER_WDOG_BARK_THOLD_REG_OFFSET,
86                     config.bark_threshold);
87    sec_mmio_write32(kBase + AON_TIMER_WDOG_BITE_THOLD_REG_OFFSET,
88                     config.bite_threshold);
89  
90    // Enable or disable the watchdog as requested.
91    uint32_t ctrl = kCtrlEnable;
jesultra commented 6 months ago

With my draft PR #22593, I see that OpenTitan wakes from deep sleep without the glitch on the EC_RST signal.

jesultra commented 6 months ago

Thinking about it. We may need to review other parts of ROM (and ROM_EXT) logic, with particular attention to whether they should run when waking up from deep sleep, or only in hard reset / poweron cases.

Pinmux for instance could have been locked by the owner code, and the reading of SW_BOOTSTRAP pins probably should not take place when coming out of deep sleep. Both because it is not necessary to have an option to go into the bootloader on leaving deep sleep, a programmer would assert the external reset for that, and because it could be that owner has set a weak pullup/pulldown on one of more of the bootstrap pins, and then at runtime switces them to be push/pull driven by OT, used to control something.

jettr commented 6 months ago

I was/am going to file an issue to audit exactly what @jesultra is saying. :) The ROM and ROM_EXT should take special care when interacting with any peripheral that is on the AON domain when coming out of deep sleep as the owner firmware may have already locked the registers.

I can also confirm that if I remove the setting the REGEN to 0 for the watchdog register in our owner firmware, the exception (and also the EC_RESET_L assertion) no longer occur!

a-will commented 6 months ago

Here are some of them (at least what I saw in the sw/device/silicon_creator directory):

jettr commented 6 months ago

Thank you Alex for providing the first pass audit. I have opened #22614 to track the full audit and fix for ROM/ROM_EXT code.

andreaskurth commented 6 months ago

Thanks for the in-depth and collaborative debugging to all of you! IIUC no further RTL-side actions are required here, so I'm removing the RTL and IP labels and am associating this with ROM. Please tag me if you disagree.

jesultra commented 5 months ago

We have found the immediate cause of the issue originally reported here, and we have #22614 tracking a review of other potential sources of similar issues. So I think we can close this one.