test_convergence is flaky when all seed-setting code is removed

melonwater211 commented 3 years ago

Introduction

Several tests, including test_convergence in test/classification/test_classifier_convergence.py, seem to be flaky when all seed-setting code (e.g. np.random.seed(0) or tf.random.set_seed(0)) is commented out.

For instance, in commit 1131d9b93798e145b2bc385fa65ff50fbc3c6d0a, test_convergence failed ~39% of the time (out of 500 runs) compared to 0% of the time (out of 500 runs) when no seed-setting code is removed.

test_convergence tests the convergence of the multitask classifier.

Motivation

Some tests can be flaky with high failure rates, but are not discovered when the seeds are set. We are trying to stabilize such tests.

Environment

The tests were run using pytest 6.2.4 in a conda environment with Python 3.6.13. The OS used was Ubuntu 16.04.

Possible Solutions

One possible solution to reduce flakiness is to change the parameters used for prediction. We tried changing the following parameters.

Increasing lr from 0.001 to 0.005 and 0.01 reduced the flakiness to ~8% and ~12% respectively.

Increasing lr from 0.001 to 0.002 reduced the flakiness to ~2%.

Increasing lr did not increase runtimes significantly.

Increasing n_epochs from 10 to 15 reduced the flakiness to ~5%.

This change increased runtimes by a factor of ~1.3.

Increasing n_epochs from 10 to 20 reduced the flakiness to ~0%.

This change increased runtimes by a factor of ~1.8.

Another possible solution is to change the values used in the assertions.

Increasing the values used in both assertions (i.e. the assertions on lines 68 and 75) from 0.05 to 0.06 reduced flakiness to ~4%.

Increasing the values used in the assertions did not increase runtimes significantly.

Please let us know if these solutions are feasible or if there are any other solutions that should be incorporated. If you are interested, we can send the details of other tests demonstrating similar behavior. We will be happy to raise a Pull Request to fix the tests and incorporate any feedback that you may have.

bhancock8 commented 3 years ago

Hi @melonwater211, thanks so much for the thorough analysis! We like the tests to be deterministic (precisely to remove flakes, so that when a test fails, we know that it's because something changed). But it's also quite desirable to have the test run with settings that should pass in nearly all settings. Let's move forward with the alteration you proposed that reduced the flakiness to ~0%: increasing n_epochs from 10 to 20. Would you mind submitting a PR and tagging me to review?

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

snorkel-team / snorkel