hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 4 forks source link

Document requirements for custom validation functions #121

Open zkamvar opened 4 days ago

zkamvar commented 4 days ago

I was testing out a potential custom validation function for https://github.com/reichlab/variant-nowcast-hub using the vignette in https://hubverse-org.github.io/hubValidations/articles/custom-functions.html#managing-dependencies-of-custom-sourced-functions

I am really glad to report that once I figured out the nuances of actually calling the new function, adding a custom function seems to be pretty plug-and-play!

Here were the steps (with missing pieces labelled):

  1. create a validations.yml file (NOTE: this can not be validations.yaml unless you want to manually modify your GitHub workflow)
  2. (implied) create a script in src/ that contains functions for validation, which contains the arguments specified in the default configuration (even if you don't use them)
  3. (missing) ensure that your function evaluates to TRUE or FALSE and then wrap the result in hubValidations::capture_check_cnd()
  4. test locally by running hubValidations::validate_submission() (missing: in the same directory of the hub)

I tested this with the simple hub and the functions I created were:

default:
    validate_model_data:
      hello:
        fn: "cromulent"
        source: "src/sum-to-one.R"
      sum1:
        fn: "sum_to_one"
        source: "src/sum-to-one.R"
sum_to_one <- function(tbl = NULL, file_path, ...) {
  res <- isTRUE(all.equal(sum(tbl$value), 1, tol = 1e-5))
  hubValidations::capture_check_cnd(res,
    file_path = file_path,
    msg_subject = "b",
    msg_attribute = "c"
  )
}

cromulent <- function(tbl = NULL, file_path, ...) {
  hubValidations::capture_check_cnd(!is.null(tbl),
    file_path = file_path,
    msg_subject = "b",
    msg_attribute = "c"
  )
}

I would be happy to tackle this and update the vignette.

annakrystalli commented 3 days ago

💯🚀

Yes! This is definitely a gap in the docs I've not had time to tackle! Would be great if you wanted to work on this.

I was envisioning an "Anatomy of a custom validation function" vignette that includes a discussion of how to use capture_check_cnd() and capture_check_info(). I also was thinking of building functionality for creating template custom validation functions (a bit like snippets / usethis::use_r()) to start users off down the right path. Something like use_custom_check()?

A few general suggestions:

zkamvar commented 3 days ago

THANK YOU! The naming conventions are definitely important to highlight!

Given that the impetus for this is https://github.com/reichlab/variant-nowcast-hub/issues/55, and it needs to be done relatively soon, I think we should restrict the guidance for the moment to only using packages that hubValidations depends on (e.g. dplyr). Dependency resolution on CI is a huge can of worms (especially if {renv} is involved).

Examples

I wonder if going through the anatomy of a function like check_tbl_value_col_ascending() would be a good exercise for an example? Looking at that gives me a good idea of how to construct the validations.

Where the functions live 🏠

I still feel src is not the right place for R functions, more for workflow scripts. These should be placed in the R directory which with a bit more tweaking (i.e. including a DESCRIPTION file) would allow for testing functions as well as potentially managing additional dependencies required for custom functions (see https://github.com/hubverse-org/hubValidations/issues/22). While folks are free to use whatever location they want, I would have a very strong preference for an R/ directory to be used by default in examples and as the default location for scripts created by use_custom_check().

I can get behind that! (For my own reference, the initial proposal for the R/ folder is in https://github.com/orgs/hubverse-org/discussions/25#discussioncomment-10167652)

Could you clarify one thing: When you want validations to be in the R/ folder, how are you thinking about the possibility of including a DESCRIPTION file and tests?

Are you thinking a structure like (A) where DESCRIPTION would live in the root of the hub:

ROOT/
├─R/
│ └─check-thing.R
├─DESCRIPTION
├─tests/
│ └─testthat/
│   └─test-check-thing.R
└─ hub-config/

or like (B) where DESCRIPTION lives inside the R/ directory:

ROOT/
├─R/
│ ├─R/
│ │ └─check-thing.R
│ ├─DESCRIPTION
│ └─tests/
│   └─testthat/
│     └─test-check-thing.R
└─ hub-config/

In either case, would you be opposed to having validation scripts live in src/validations/R/ by default (C)?

ROOT/
├─src/
│ └─validations/
│   ├─R/
│   │ └─check-thing.R
│   ├─DESCRIPTION
│   └─tests/
│     └─testthat/
│       └─test-check-thing.R
└─ hub-config/

This way, it isolates the validation code from the workflow scripts AND allows a package to be built around it (e.g. src/validations/DESCRIPTION and src/validations/tests/testthat/)

annakrystalli commented 2 days ago

Ooooh, interesting. I feel your suggestion of storing any functional R code neatly away in it's own directory is the neatest option.

Given the src directory is something we seem to be committed to as best practice, I think I'm going to vote for:

ROOT/
├─src/
│ └─validations/
│   ├─R/
│   │ └─check-thing.R
│   ├─DESCRIPTION
│   └─tests/
│     └─testthat/
│       └─test-check-thing.R
└─ hub-config/

as the default structure.

I think this would also provide some option for installing additional dependencies required by validation functions by adding them as imports to src/validations/DESCRIPTION using pak::local_install("src/validations") in the CI and if using renv via the package cellar. Obviously the renv approach is more of a mission but only affects anyone that does need additional packages AND is using renv. While it might be rare, it is something that eventually needs supporting.

In any case, I think the use of a DESCRIPTION file and pak::local_install("src/validations") as an initial recommendation does not seem to much to discuss (although I'll do it separately as part of resolving #22)

Also, random question which may also need to be touched upon here, what are your thoughts on #20 ?