roocs / rook

A Web Processing Service for roocs: remote operations on climate simulations.
https://rook-wps.readthedocs.io/en/latest/
Apache License 2.0
4 stars 6 forks source link

Plan for extensive unit testing of ESGF data #222

Open agstephens opened 1 year ago

agstephens commented 1 year ago

See updated (improved) text below...

agstephens commented 1 year ago

@alaniwi @cehbrecht @huard: here are my thoughts about building a more automated system for unit test building/running for ESGF datasets. Do you have any thoughts about how we can best do it?

huard commented 1 year ago

Not sure I understand the need to template the test builder, but I'm probably missing something.

A few ideas in no particular order...

One heuristic we can use here to reduce the test volume is to assume that all files are structured identically under a given directory structure. I've used this to "walk" through the catalog, and pick only one dataset per "level". This will make sure that every model is tested, without going through every variable, member and time step for a given model. This single dataset can be randomized to increase coverage over time.

Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.

Use pytest.mark.parameterize to apply tests to a programmatically defined list of datasets.

agstephens commented 1 year ago

Hi @huard: thanks for your response. It's a very good point using parameterize is a better approach than building millions of tests from a template. I think that I was confusing "needing to test lots of datasets" with "needing lots of tests" - the reason is that we might find corner-cases where we want specific tests. However, the actual code (i.e. not the tests) should be where we fix how those corner-cases are handled - and then the tests themselves just run. I will adjust the suggested plan below, simplified to:

Plan for extensive unit testing of ESGF data in "roocs" - with subset

Need tests per:

Need tests that cover the functionality exposed, and deal with corner cases, e.g.:

Use pytest.mark.parametrize to handle lists/dictionaries of inputs that cover the myriad datasets we want to test:

agstephens commented 1 year ago

Some thoughts about how we tackle this problem:

Later, we'll work out the following:

agstephens commented 1 year ago

Discussing with @cehbrecht, how we might decide which dataset IDs to send into this test...

We might assume...:

The process could be:

  1. Get a list of all datasets - maybe cached as a .csv.gz file (or other compression) alongside the tests
  2. Work out a subset of that list based on sampling across each combination of facets
  3. Store the list of samples ready for testing
agstephens commented 12 months ago

@alaniwi here is the image that I shared today:

image

agstephens commented 11 months ago

Provide multi-site support as follows:

alaniwi commented 11 months ago

Code for the multi-site support is implemented, and an command-line script merge-test-logs is added -- in addition to the data-pools-checks command-line script that generates the logs in the first place.

https://github.com/roocs/daops/blob/52e32b1697f607eb93828dc509cde6ddd7ab6bad/setup.py#L83-L84

Currently this is in the test_data_pools_new branch. @cehbrecht I'll generate a new PR. With the exception of the above two added lines in setup.py (and a couple of gitignore lines), this only adds new files under a new subdirectory, so should hopefully be an easy merge.

alaniwi commented 11 months ago

@cehbrecht PR is at https://github.com/roocs/daops/pull/108 but one of the tests is failing. I'll take a look next week.