Flaky test in `rem/tests/test_inverse_confusion_matrix.py`

purva-thakre commented 4 months ago

Noticed this failure twice in a CI run for Windows. If I rerun the test, it passes.

https://github.com/unitaryfund/mitiq/actions/runs/9748119578/job/26902405324?pr=2347#step:6:4326

purva-thakre commented 4 months ago

Noticed the same failure for windows-python 3.11

https://github.com/unitaryfund/mitiq/actions/runs/9766045422/job/26958119314?pr=2432#step:6:4325

purva-thakre commented 4 months ago

The same failure on windows-python3.12

https://github.com/unitaryfund/mitiq/actions/runs/9765921429/job/26957722256#step:6:4326

purva-thakre commented 4 months ago

I am unassigning this for myself because I cannot replicate this locally. My laptop does not have enough memory for local docker containers.

natestemen commented 3 months ago

Looks like this is also happening on ubuntu: https://github.com/unitaryfund/mitiq/actions/runs/10267476608/job/28408109210?pr=2452#step:6:4486

natestemen commented 3 months ago

Took a look at this in the mitiq coding call today and made some progress.

Problem

https://github.com/unitaryfund/mitiq/blob/56ee1731514aa19ea60e86d41e8d7609bdc963c3/mitiq/rem/tests/test_inverse_confusion_matrix.py#L32-L33

This test (in essence)

generates a random sequence of 1000 0s and 1s
sums that sequence (as ints)
Asserts that the resulting sum is between 450 and 550

Some stats tell us the sum should fall in this range 99.94% of the time, meaning that on average we should see a value outside this range every 1,667 runs.

Solutions

We considered the following solutions

Removing the test. Since this is non-deterministic behavior and the test is mostly testing to np.random.choice, it is not totally needed.
Increasing the valid range of values (say it must fall between 400 and 600).
Setting a seed to make the sum deterministic.

Solution 1 was determined to be unnecessary since we still want this code's functionality to be tested with cases other than the rudimentary ones tested in this same test. Solution 2 was determined to not be a complete fix, as it is still possible for the test to fail, albeit much less likely. Solution 3 was deemed the best solution.

Diagnosis

We ran the failing test on my machine (macos) on repeat 2000 times using the approach outlined here. We found it failed roughly at that cadence (1/2000 runs) which aligns with the math in the section above.

natestemen commented 3 months ago

And just so no one else clashes here, I have a PR incoming.

nathanshammah commented 3 months ago

Great investigation!

unitaryfund / mitiq