agstephens commented 1 year ago

See updated (improved) text below...

agstephens commented 1 year ago

@alaniwi @cehbrecht @huard: here are my thoughts about building a more automated system for unit test building/running for ESGF datasets. Do you have any thoughts about how we can best do it?

huard commented 1 year ago

Not sure I understand the need to template the test builder, but I'm probably missing something.

A few ideas in no particular order...

One heuristic we can use here to reduce the test volume is to assume that all files are structured identically under a given directory structure. I've used this to "walk" through the catalog, and pick only one dataset per "level". This will make sure that every model is tested, without going through every variable, member and time step for a given model. This single dataset can be randomized to increase coverage over time.

Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.

Use pytest.mark.parameterize to apply tests to a programmatically defined list of datasets.

agstephens commented 1 year ago

Hi @huard: thanks for your response. It's a very good point using parameterize is a better approach than building millions of tests from a template. I think that I was confusing "needing to test lots of datasets" with "needing lots of tests" - the reason is that we might find corner-cases where we want specific tests. However, the actual code (i.e. not the tests) should be where we fix how those corner-cases are handled - and then the tests themselves just run. I will adjust the suggested plan below, simplified to:

Plan for extensive unit testing of ESGF data in "roocs" - with subset

Need tests per:

project
node/site
that can just run as unit tests if flagged (for each relevant site)

Need tests that cover the functionality exposed, and deal with corner cases, e.g.:

get dimensions
assign a small bounding box inside dims
- Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.
assign subsets in other dimensions (time/level)
run subset
check valid array with min and max being different
check each dimension is in range of the subset specified
check output array is not all missing_values
any other assertions that are useful

Use pytest.mark.parametrize to handle lists/dictionaries of inputs that cover the myriad datasets we want to test:

the data structure that we use for the test inputs can grow and grow
it will run differently at different sites based on what data they have
make sure that every model is tested without going through every variable, member and time step for a given model.

agstephens commented 1 year ago

Some thoughts about how we tackle this problem:

[x] Build a single test function called test_subset_in_data_pools in the module tests/test_data_pools.py
- [x] Define a simple variable, data_pool_tests_db which is a list with one dataset ID in it.
- [x] Pick a CMIP6 dataset at CEDA and make it the only record in data_pool_tests_db (a list - to start with)
- [x] Use @pytest.mark.parametrize("record", data_pool_tests_db)
- [x] And: def test_subset_in_data_pools(record):
- [x] get dimensions
- [x] assign a small bounding box inside dims
- [x] Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.
- [x] assign subsets in other dimensions (time/level)
- [x] run subset - based on these example tests (invoking the Python library interface to daops...subset()):
- https://github.com/roocs/daops/blob/master/tests/test_operations/test_subset.py
- [x] check valid array with min and max being different
- [x] check each dimension is in range of the subset specified
- [x] check output array is not all missing_values
- [x] any other assertions that are useful

Later, we'll work out the following:

[x] Only deal with the database element when the test is well-formed
[x] Some thoughts about the db:
- we can split it into multiple DBs / files (per site) - because there is no interaction
- we could use sqlite for each DB, then combine them before each git commit into a CSV file for visibility.
- should it be an actual database, or a pickle, or CSV file?
- Fields it might contain:
- dataset_id(s)
- site - e.g. ceda, dkrz etc.
- domain - e.g. actual domain of dataset in time and space
- last_ran_subset_parameters - i.e. the time and space constraints applied in the last test
- result - of last test (success or fail)
- code_version(s) - of key libraries
- last_updated - when was it last run?
- Might want to be able to say:
- only run if not run before
- only run if last_updated older than 2 years
- re-run everything

agstephens commented 1 year ago

Discussing with @cehbrecht, how we might decide which dataset IDs to send into this test...

We might assume...:

we want to test a large coverage of the overall project simulations (which can be considered as a sparse hypercube of facet key/values)
e.g. we want to test some data from each model, for each frequency, for each variable
we should always test the latest version
there are some facets that we might be able to ignore when sampling to get a representative coverage, e.g.:
- If 3 institutions ran the EC-EARTH model: we can (hopefully) assume that we only need that model from one institution - so we do not need to sample across institutions.
- Most variables within a given ensemble, should have a common structure - so we don't need to test them all

The process could be:

Get a list of all datasets - maybe cached as a .csv.gz file (or other compression) alongside the tests
Work out a subset of that list based on sampling across each combination of facets
Store the list of samples ready for testing

agstephens commented 12 months ago

@alaniwi here is the image that I shared today:

agstephens commented 11 months ago

Provide multi-site support as follows:

[x] Write the site name (e.g. "ceda" or "dkrz") into a version of the "csv.gz" file.
[x] Include a consolidate command/script to merge all results "csv.gz" files from multiple sites into a single file.

alaniwi commented 11 months ago

Code for the multi-site support is implemented, and an command-line script merge-test-logs is added -- in addition to the data-pools-checks command-line script that generates the logs in the first place.

https://github.com/roocs/daops/blob/52e32b1697f607eb93828dc509cde6ddd7ab6bad/setup.py#L83-L84

Currently this is in the test_data_pools_new branch. @cehbrecht I'll generate a new PR. With the exception of the above two added lines in setup.py (and a couple of gitignore lines), this only adds new files under a new subdirectory, so should hopefully be an easy merge.

alaniwi commented 11 months ago

@cehbrecht PR is at https://github.com/roocs/daops/pull/108 but one of the tests is failing. I'll take a look next week.

roocs / rook

Plan for extensive unit testing of ESGF data #222

Plan for extensive unit testing of ESGF data in "roocs" - with subset