Closed annakrystalli closed 2 weeks ago
I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.
While it would be useful, I also agree that as "large" and "take time" are hard to properly define, not sure just how useful. There are plans to try and improve the performance of validations though so as part of that work we might get a better sense of what would be useful in the documentation also? I'll draft the performance issue up today and make a note about including more on performance in the docs too.
Firstly thanks so much for your review and thorough testing @LucieContamin ! It's been really useful to work through. In response I've made a number of changes to the functionality / docs:
errors
. Hopefully, the information returned is much more useful and intuitive and the information in the docs is now enough to help explain what each check failure means and direct a tema to fixing it.hubValidations
does not support (https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/issues/88). I have added questions to that issue to help me understand the functionality better, your input would be greatly appreciated!Let me know if these resolve your issues for the time being and feel free to open more issues if you think there's more that needs to be addressed.
This PR implement and adds new tests for checking the validity of submissions of samples using the v3 schema sample spec. See v3 sample validation spec for details.
Specific Sample validation tests implemented (#80)
check_tbl_values()
)check_tbl_values_required()
)check_tbl_spl_n()
.check_tbl_spl_non_compound_tid()
.check_tbl_spl_compound_tid()
.The key to the new functions
check_tbl_spl_n()
,check_tbl_spl_non_compound_tid()
andcheck_tbl_spl_compound_tid()
is a table of hashes on model output data joined to the output of the newhubData::expand_model_out_val_grid(include_sample_ids = TRUE)
, where the output type id column for v3 samples effectively contains the compound_idx. The hashes are calculated on the relevant subsets of values of each sample and aggregated/counted at the relevant level for each check, ie:check_tbl_spl_compound_tid()
: Ensure there is only a single unique hash of the combination of values across compound task id columns of all rows associated with samples for a given compound idx.check_tbl_spl_non_compound_tid()
: Ensure there is only a single unique hash of the combination of values across non-compound task id columns of all rows associated with samples for a modeling task.These checks are performed separately for each round modeling task item, allowing for differences between compound task id sets between round modeling tasks.
Still to do:
spl_hash_tbl()
on which many of the new checks depend can be time consuming with complex configs. Have attempted memoisation but have encountered difficulties in testing so this is still a work in progress but shouldn't change the rest of the functionality.