signaux-faibles / predictsignauxfaibles

Dépôt du code python permettant la production de liste de prédiction Signaux Faibles.
MIT License
6 stars 1 forks source link

Testing canonic model run #49

Closed slebastard closed 3 years ago

slebastard commented 3 years ago

We should set up a e2e test of running the default model, training on say 1000 examples, testing and predicting on similar volumes. What do you think?

slebastard commented 3 years ago

Maybe we can try to set ENV="test" within our tests, make our train/test/predict datasets to be extremely small (ex: 1000 entries) when ENV=="test", and have a couple of tests such as:

vviers commented 3 years ago

Maybe we can try to set ENV="test" within our tests, make our train/test/predict datasets to be extremely small (ex: 1000 entries) when ENV=="test", and have a couple of tests such as:

yes, that's a great idea

  • testing whether python3 -m predictsignauxfaibles runs and outputs a prediction-YYYYMMDD-hhmmss.csv and stats-YYYYMMDD-hhmmss.json in model_runs/YYYYMMDD-hhmmss (created by the test run) that we would remove afterwards

yes too

  • checking that the default values in stats-YYYYMMDD-hhmmss.json are correct

what do you mean by "correct" ? should we compute all this on a completely deterministic dataset so that we know exactly which metrics to expect here ?

  • checking that optional arguments are correctly taken into account and end up in stats-YYYYMMDD-hhmmss.json: model_name, train_from, train_to, train_sample, etc

sure


In any case, these e2e tests should not be run against the production database but against a fake, dockerized MongoDB instance filled with made up data (we can look at what they did in opensignauxfaibles)

slebastard commented 3 years ago

what do you mean by "correct" ? should we compute all this on a completely deterministic dataset so that we know exactly which metrics to expect here ? I was thinking of testing:

  • that fields that should be in the json actually exist
  • testing that stats have the right value, which is feasible for the moment as most exported stats correspond to either model name, run ID, optional arguments... In the future, if we start to output more complex stats such as model coverage rate, NA rates, ... we may be able to test their value on a dummy test dataset. For performance metrics (classification accuracy, f_beta score, etc), I don't think we will be able to test them

I think we should implement these tests before our release. @vviers WDYT?

vviers commented 3 years ago

We can discuss this quickly tomorrow :)

If we just want to test that the format of the stats dict is correct and that the correct args are logged, then that's a unit test (perhaps with some patching or mocking needed for your SFDatasets, see https://docs.pytest.org/en/stable/monkeypatch.html)

We can test performance metrics using fake, deterministic testing data (but that would only be interesting if we use our own custom eval metrics, otherwise that's just re-testing sklearn :) )

vviers commented 3 years ago

Closing for now, we'll open another issue on "end-to-end" testing when the time comes