Closed melonwater211 closed 2 years ago
Hi @melonwater211, thanks so much for the thorough analysis! We like the tests to be deterministic (precisely to remove flakes, so that when a test fails, we know that it's because something changed). But it's also quite desirable to have the test run with settings that should pass in nearly all settings. Let's move forward with the alteration you proposed that reduced the flakiness to ~0%: increasing n_epochs
from 10 to 20. Would you mind submitting a PR and tagging me to review?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Introduction
Several tests, including
test_convergence
intest/classification/test_classifier_convergence.py
, seem to be flaky when all seed-setting code (e.g.np.random.seed(0)
ortf.random.set_seed(0)
) is commented out.For instance, in commit 1131d9b93798e145b2bc385fa65ff50fbc3c6d0a,
test_convergence
failed ~39% of the time (out of 500 runs) compared to 0% of the time (out of 500 runs) when no seed-setting code is removed.test_convergence
tests the convergence of the multitask classifier.Motivation
Some tests can be flaky with high failure rates, but are not discovered when the seeds are set. We are trying to stabilize such tests.
Environment
The tests were run using
pytest 6.2.4
in aconda
environment withPython 3.6.13
. The OS used wasUbuntu 16.04
.Possible Solutions
One possible solution to reduce flakiness is to change the parameters used for prediction. We tried changing the following parameters.
Increasing
lr
from 0.001 to 0.005 and 0.01 reduced the flakiness to ~8% and ~12% respectively.Increasing
lr
from 0.001 to 0.002 reduced the flakiness to ~2%.Increasing
lr
did not increase runtimes significantly.Increasing
n_epochs
from 10 to 15 reduced the flakiness to ~5%.This change increased runtimes by a factor of ~1.3.
Increasing
n_epochs
from 10 to 20 reduced the flakiness to ~0%.This change increased runtimes by a factor of ~1.8.
Another possible solution is to change the values used in the assertions.
Increasing the values used in both assertions (i.e. the assertions on lines 68 and 75) from 0.05 to 0.06 reduced flakiness to ~4%.
Increasing the values used in the assertions did not increase runtimes significantly.
Please let us know if these solutions are feasible or if there are any other solutions that should be incorporated. If you are interested, we can send the details of other tests demonstrating similar behavior. We will be happy to raise a Pull Request to fix the tests and incorporate any feedback that you may have.