Feature/ Handle V3 sample specification

annakrystalli commented 1 month ago

This PR implement and adds new tests for checking the validity of submissions of samples using the v3 schema sample spec. See v3 sample validation spec for details.

Specific Sample validation tests implemented (#80)

[x] Validate all value combinations valid (as part of check_tbl_values())
[x] Validate all required value combinations submitted (as part of check_tbl_values_required())
[x] Validate correct number of samples per compound idx submitted. Added through function check_tbl_spl_n().
[x] Validate that samples within a submission file contain the same combination of optional (non-compound task id) values across all samples. Added through function check_tbl_spl_non_compound_tid().
[x] Validate that samples conform to sample dependence defined by the compound task id set configuration i.e. all samples for a given compound idx contain the same unique combination of compound task id values. Added through function check_tbl_spl_compound_tid().
[x] Add new validation checks to validate_model_data() in a back compatible way (i.e. only deploy if using a v3 config).

The key to the new functions check_tbl_spl_n(), check_tbl_spl_non_compound_tid() and check_tbl_spl_compound_tid() is a table of hashes on model output data joined to the output of the new hubData::expand_model_out_val_grid(include_sample_ids = TRUE), where the output type id column for v3 samples effectively contains the compound_idx. The hashes are calculated on the relevant subsets of values of each sample and aggregated/counted at the relevant level for each check, ie:

check_tbl_spl_n()` count unique output type id values per compound idx. The hash table provides a mapping between output type ids and compound idxs.
check_tbl_spl_compound_tid(): Ensure there is only a single unique hash of the combination of values across compound task id columns of all rows associated with samples for a given compound idx.
check_tbl_spl_non_compound_tid(): Ensure there is only a single unique hash of the combination of values across non-compound task id columns of all rows associated with samples for a modeling task.

These checks are performed separately for each round modeling task item, allowing for differences between compound task id sets between round modeling tasks.

Still to do:

[ ] The spl_hash_tbl() on which many of the new checks depend can be time consuming with complex configs. Have attempted memoisation but have encountered difficulties in testing so this is still a work in progress but shouldn't change the rest of the functionality.
[ ] Add more tests, especially varying the compound task id set.

github-actions[bot] commented 1 month ago

🚀 Deployed on https://66728738769adaca9558e175--hubvalidations-pr-previews.netlify.app

annakrystalli commented 3 weeks ago

I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.

While it would be useful, I also agree that as "large" and "take time" are hard to properly define, not sure just how useful. There are plans to try and improve the performance of validations though so as part of that work we might get a better sense of what would be useful in the documentation also? I'll draft the performance issue up today and make a note about including more on performance in the docs too.

annakrystalli commented 2 weeks ago

Firstly thanks so much for your review and thorough testing @LucieContamin ! It's been really useful to work through. In response I've made a number of changes to the functionality / docs:

Firstly, to make things more streamlined, I've changed the sequence of execution of sample checks and set the check for the compound task id and non-compound task id to return errors and cause validation to return early. That way samples are only counted once we know we have well formed samples (I know it's different to what you do in scenario hub but it makes more sense to me atm, happy to revisit and get more opinions in next round of work on samples though!).
Next I've reworked the check messages, the names of the objects returned if validation fails and the information returned as well as part of each errors. Hopefully, the information returned is much more useful and intuitive and the information in the docs is now enough to help explain what each check failure means and direct a tema to fixing it.
I've also opened this issue to add functionality to validate coarser compound task id sets which atm hubValidations does not support (https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/issues/88). I have added questions to that issue to help me understand the functionality better, your input would be greatly appreciated!

Let me know if these resolve your issues for the time being and feel free to open more issues if you think there's more that needs to be addressed.

hubverse-org / hubValidations

Feature/ Handle V3 sample specification #82

Specific Sample validation tests implemented (#80)

Still to do: