Closed slebastard closed 3 years ago
Maybe we can try to set ENV="test" within our tests, make our train/test/predict datasets to be extremely small (ex: 1000 entries) when ENV=="test", and have a couple of tests such as:
python3 -m predictsignauxfaibles
runs and outputs a prediction-YYYYMMDD-hhmmss.csv
and stats-YYYYMMDD-hhmmss.json
in model_runs/YYYYMMDD-hhmmss (created by the test run) that we would remove afterwardsstats-YYYYMMDD-hhmmss.json
are correctstats-YYYYMMDD-hhmmss.json
: model_name, train_from, train_to, train_sample, etc. To this end, I've set up an empty ParserTest
object in tests/integration/test_modelrun_components.py
. Depending on what we do with it, this test object may either stay there or move to tests/e2e
Maybe we can try to set ENV="test" within our tests, make our train/test/predict datasets to be extremely small (ex: 1000 entries) when ENV=="test", and have a couple of tests such as:
yes, that's a great idea
- testing whether
python3 -m predictsignauxfaibles
runs and outputs aprediction-YYYYMMDD-hhmmss.csv
andstats-YYYYMMDD-hhmmss.json
in model_runs/YYYYMMDD-hhmmss (created by the test run) that we would remove afterwards
yes too
- checking that the default values in
stats-YYYYMMDD-hhmmss.json
are correct
what do you mean by "correct" ? should we compute all this on a completely deterministic dataset so that we know exactly which metrics to expect here ?
- checking that optional arguments are correctly taken into account and end up in
stats-YYYYMMDD-hhmmss.json
: model_name, train_from, train_to, train_sample, etc
sure
In any case, these e2e tests should not be run against the production database but against a fake, dockerized MongoDB instance filled with made up data (we can look at what they did in opensignauxfaibles
)
what do you mean by "correct" ? should we compute all this on a completely deterministic dataset so that we know exactly which metrics to expect here ? I was thinking of testing:
- that fields that should be in the json actually exist
- testing that stats have the right value, which is feasible for the moment as most exported stats correspond to either model name, run ID, optional arguments... In the future, if we start to output more complex stats such as model coverage rate, NA rates, ... we may be able to test their value on a dummy test dataset. For performance metrics (classification accuracy, f_beta score, etc), I don't think we will be able to test them
I think we should implement these tests before our release. @vviers WDYT?
We can discuss this quickly tomorrow :)
If we just want to test that the format of the stats
dict is correct and that the correct args are logged, then that's a unit test (perhaps with some patching or mocking needed for your SFDatasets, see https://docs.pytest.org/en/stable/monkeypatch.html)
We can test performance metrics using fake, deterministic testing data (but that would only be interesting if we use our own custom eval metrics, otherwise that's just re-testing sklearn :) )
Closing for now, we'll open another issue on "end-to-end" testing when the time comes
We should set up a e2e test of running the default model, training on say 1000 examples, testing and predicting on similar volumes. What do you think?