lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.49k stars 742 forks source link

[test-triage] OTBN timeouts with multiple tests #23506

Closed martin-velay closed 1 month ago

martin-velay commented 3 months ago

Hierarchy of regression failure

Block level

Failure Description

Steps to Reproduce

util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_ctrl_redun -t xcelium --fixed-seed 42265330512524754647320219056712357229922074576056242959858639090007540897643



### Tests with similar or related failures

It might be related to the same root cause due to recent changes:
- [ ] otbn_rf_base_intg_err
- [ ] otbn_ctrl_redun
hayleynewton commented 3 months ago

otbn_rf_base_intg_err - back to 100% otbn_ctrl_redun - 91.7%

rswarbrick commented 2 months ago

I suspect that these are different issues.

Firstly, let's consider the otbn_rf_base_intg_err message. The test tries to inject an error by one or two bit error into a register read. To do this, it waits until a cycle when an instruction is trying to read from the given side of the register file, and then uses the force statement to bodge the value.

The message is saying that we've waited ages and haven't actually seen any instructions coming through that read from the register file. Here, "ages" is defined as a time in clock cycles (currently 20,000). My guess is that we're actually blocked, waiting for a seed from the EDN, so aren't running any instructions at all in that time period.

There's a reasonably obvious solution: tweak the vseq so that it times out when the operation finishes. If the device completely hangs then the test will (eventually) time out from a phase timeout. I've just pushed #23582 which should implement this.

rswarbrick commented 2 months ago

For the second test, it looks like we sometimes fail to find a "good time" and a re-run can perturb things so that we do. A local run doing so gets the failure rate from roughly 4/50 to 1/50 (and failing with a different behaviour), so I think this triage issue should be solved by that change: #23583.

rswarbrick commented 2 months ago

Removing the M4 milestone association. I think this will be fixed by the two PRs mentioned above, and am also certain that the issue is unrelated to the M4 exit criteria.

martin-velay commented 2 months ago

Thanks Rupert!

rswarbrick commented 1 month ago

The first vseq passed at 100% over the last 9 nights. The second test passes at a reasonable rate and the sporadic failures that I see when running locally aren't the same sort of timeout as described above.

Closing this issue because I think the problem that it describes is fixed.