Looking at the number of times a pval check is made, with alpha=1e-4, the probability of getting a failure is 25% (under the null hypothesis). This is pretty high - it's no wonder we're seeing flakiness.
I've not figured out the exact Bonferroni-type corrections, but reducing this to alpha=1e-5 gives a probability of getting a failure as 3% (under the null hypothesis), which is still pretty high but probably tolerable. I've usually seen genuine failures giving pvals of 1e-180 or similar, so I don't think false negatives should be a serious issue.
I've also removed setting the pool size, as I think that's unnecessarily slowing things down.
Fixes the flakiness by decreasing alpha.
Looking at the number of times a pval check is made, with alpha=1e-4, the probability of getting a failure is 25% (under the null hypothesis). This is pretty high - it's no wonder we're seeing flakiness.
I've not figured out the exact Bonferroni-type corrections, but reducing this to alpha=1e-5 gives a probability of getting a failure as 3% (under the null hypothesis), which is still pretty high but probably tolerable. I've usually seen genuine failures giving pvals of 1e-180 or similar, so I don't think false negatives should be a serious issue.
I've also removed setting the pool size, as I think that's unnecessarily slowing things down.