debug: fix sporadic failures of memory sampling tests

aap-sc commented 6 months ago

Memory sampling tests fail sporadically for spike targets. A typical failure looks as follows (ROI from test log):

---------------------------------[ Message ]----------------------------------
139670831 not less than 124104544
--------------------------------[ Traceback ]---------------------------------
    ... SECTION IS SKIPPED FOR READABILITY ...
    raise TestFailed(f"{a!r} not less than {b!r}", comment)
testlib.TestFailed

Few observations:

139670831 is 0x0853352f in hex, while 124104544 is 0x0765af60
Now, the assert which is failing corresponds to the following expression:

  assertLess(value, previous_value + tolerance)

tolerance is 0x500000. (124104544 - 0x500000) is 0x0715af60

If we look at the sampling output for such failing test, we'll see:

...
0x1212340c5c: 0x0715af60
timestamp after: 878087500
timestamp before: 878088133
0x1212340c5c: 0x0853352f
...

The log above demonstrates the reason for the failure. Since memory sampling occures every poll (which by default happens approximately every 100ms) a value of the counter may exceed the threshold if the time between subsequent polls is increased (for whatever reason).

In my opinion the failing assert can be safely removed, since the checks it perform are quite brittle and cannot be generalized. The assert violation is affected by CPU performance and sporadic delays between polls.

For now, instead of assert removal we just avoid checks in-between memory sample bursts. This way we still can be certain that memory samples are frequent enough and hopefully this will avoid sporadic failures.

aap-sc commented 6 months ago

@TommyMurphyTM1234 FYI

aswaterman commented 6 months ago

FWIW, I’d favor deletion of any timing-dependent assertions.

riscv-software-src / riscv-tests

debug: fix sporadic failures of memory sampling tests #556