test "Prediction accuracy for minority class increases with higher weight" is flaky

MichaelChirico commented 1 week ago

pkgload::load_all()
mean(grepl("F", capture.output({
  for (ii in 1:100) testthat::test_file(
    "tests/testthat/test_classweights.R",
    reporter = testthat::MinimalReporter)
})))
# [1] 0.03

i.e. it fails about 3% of the time. The test that fails is this one:

https://github.com/imbs-hl/ranger/blob/6e5d6ccaaf47d04a32f45bedd17c782528732a20/tests/testthat/test_classweights.R#L26

And the failure reads:

── Failure (test_classweights.R:26:3): Prediction accuracy for minority class increases with higher weight ──
`acc_minor_weighted` is not strictly more than `acc_minor`. Difference: 0

Presumably it's some tiny numeric difference being observed (it would be nice if {testthat} helps us here, right now it's strictly limited to 3 digits' difference: https://github.com/r-lib/testthat/issues/2006).

mnwright commented 1 week ago

Thanks! Such tests are always a little bit dangerous (but useful).

I'll increase the sample size and number of trees, that should help.

MichaelChirico commented 1 week ago

it's tough to know the right level of tolerable flakiness, IMO 3% is definitely too high (except maybe if it's really costly to increase the precision, but then I would hide such tests from CRAN).

Thanks for addressing this!

imbs-hl / ranger

test "Prediction accuracy for minority class increases with higher weight" is flaky #747