[test-triage] OTBN timeouts with multiple tests

martin-velay commented 3 months ago

Hierarchy of regression failure

Block level

Failure Description

otbn_rf_base_intg_err with seed: 105816672124054756254243263758896087906517619155141568323899745862435846495702 UVM_FATAL @ 931274490 ps: (otbn_rf_base_intg_err_vseq.sv:32) uvm_test_top.env.virtual_sequencer [uvm_test_top.env.virtual_sequencer.otbn_rf_base_intg_err_vseq] Timeout while waiting for register file A to be used
otbn_ctrl_redun with seed: 42265330512524754647320219056712357229922074576056242959858639090007540897643 UVM_FATAL @ 69456697 ps: (otbn_ctrl_redun_vseq.sv:31) uvm_test_top.env.virtual_sequencer [uvm_test_top.env.virtual_sequencer.otbn_ctrl_redun_vseq] Never found a time to inject an error.

Steps to Reproduce

GitHub Revision: a182fcef27866fb502d8cceb810bdf76cfa3ef42

dvsim invocation command to reproduce the failure, inclusive of build and run seeds:


util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_rf_base_intg_err -t xcelium --fixed-seed 105816672124054756254243263758896087906517619155141568323899745862435846495702

util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_ctrl_redun -t xcelium --fixed-seed 42265330512524754647320219056712357229922074576056242959858639090007540897643



### Tests with similar or related failures

It might be related to the same root cause due to recent changes:
- [ ] otbn_rf_base_intg_err
- [ ] otbn_ctrl_redun

hayleynewton commented 3 months ago

otbn_rf_base_intg_err - back to 100% otbn_ctrl_redun - 91.7%

rswarbrick commented 2 months ago

I suspect that these are different issues.

Firstly, let's consider the otbn_rf_base_intg_err message. The test tries to inject an error by one or two bit error into a register read. To do this, it waits until a cycle when an instruction is trying to read from the given side of the register file, and then uses the force statement to bodge the value.

The message is saying that we've waited ages and haven't actually seen any instructions coming through that read from the register file. Here, "ages" is defined as a time in clock cycles (currently 20,000). My guess is that we're actually blocked, waiting for a seed from the EDN, so aren't running any instructions at all in that time period.

There's a reasonably obvious solution: tweak the vseq so that it times out when the operation finishes. If the device completely hangs then the test will (eventually) time out from a phase timeout. I've just pushed #23582 which should implement this.

rswarbrick commented 2 months ago

For the second test, it looks like we sometimes fail to find a "good time" and a re-run can perturb things so that we do. A local run doing so gets the failure rate from roughly 4/50 to 1/50 (and failing with a different behaviour), so I think this triage issue should be solved by that change: #23583.

rswarbrick commented 2 months ago

Removing the M4 milestone association. I think this will be fixed by the two PRs mentioned above, and am also certain that the issue is unrelated to the M4 exit criteria.

martin-velay commented 2 months ago

Thanks Rupert!

rswarbrick commented 1 month ago

The first vseq passed at 100% over the last 9 nights. The second test passes at a reasonable rate and the sporadic failures that I see when running locally aren't the same sort of timeout as described above.

Closing this issue because I think the problem that it describes is fixed.

lowRISC / opentitan