hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 3 forks source link

Columns where values are dependent on the value of other columns cause problems in value combination validation. #38

Open annakrystalli opened 9 months ago

annakrystalli commented 9 months ago

The problem has been uncovered by trying to validate https://github.com/annakrystalli/FluSight-forecast-hub/blob/test-pr/model-output/UMass-trends_ensemble/2023-10-14-UMass-trends_ensemble.csv

Our expand grid functions are not aware that target_end_date value is dependent on the reference_date value. They treat the values of each task ID as independent and produce the following erroneous grid of valid value combinations where the values of target_end_date are clearly inconsistent with the horizon.

#> # A tibble: 10 × 7
#>    reference_date target            horizon location target_end_date output_type
#>    <chr>          <chr>             <chr>   <chr>    <chr>           <chr>      
#>  1 2023-10-14     wk ahead inc flu… 1       US       2023-10-21      quantile   
#>  2 2023-10-14     wk ahead inc flu… 2       US       2023-10-21      quantile   
#>  3 2023-10-14     wk ahead inc flu… 3       US       2023-10-21      quantile   
#>  4 2023-10-14     wk ahead inc flu… 4       US       2023-10-21      quantile   
#>  5 2023-10-14     wk ahead inc flu… 1       01       2023-10-21      quantile   
#>  6 2023-10-14     wk ahead inc flu… 2       01       2023-10-21      quantile   
#>  7 2023-10-14     wk ahead inc flu… 3       01       2023-10-21      quantile   
#>  8 2023-10-14     wk ahead inc flu… 4       01       2023-10-21      quantile   
#>  9 2023-10-14     wk ahead inc flu… 1       02       2023-10-21      quantile   
#> 10 2023-10-14     wk ahead inc flu… 2       02       2023-10-21      quantile   
#> # ℹ 1 more variable: output_type_id <chr>

Created on 2023-09-29 with reprex v2.0.2

This has 2 implications:

1) It can validate erroneous combinations of values (eg a row with reference_date: 2023-10-14, horizon: 2 and reference_date: 2023-10-21). This can be mitigated by the additional optional test that checks the values of target_end_date with respect to horizon and reference_date. 2) MORE IMPORTANTLY: It can cause erroneous failures in required values checks. For example, here's a subset of the missing values the check for required values erroneously returns.

#> # A tibble: 10 × 7
#>    reference_date horizon target            location target_end_date output_type
#>    <date>           <int> <chr>             <chr>    <date>          <chr>      
#>  1 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  2 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  3 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  4 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  5 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  6 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  7 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  8 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#>  9 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#> 10 2023-10-14           1 wk ahead inc flu… 01       2023-10-28      quantile   
#> # ℹ 1 more variable: output_type_id <chr>

Created on 2023-09-29 with reprex v2.0.2

What's going on is, some quantile values are being submitted for optional horizon 2 as well as required horizon 1. The horizon 2 values have different target_end_date values (2023-10-28 rather than 2023-10-21 for horizon 1). Note as well that all values in target_end_date are configured as optional. The check is therefore detecting that data for the optional target_end_date value 2023-10-28 and optional horizon value 2 has been supplied but not for optional target_end_date value 2023-10-28 and required horizon value 1. It is therefore throwing an error even though a target_end_date value 2023-10-28 and required horizon value 1 is invalid for a reference date of 2023-10-14.

Overall this is caused by the logic discussed in https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/issues/17 and encapsulated by the required values only model out submission template and arises from us not being able to define relationships between values in different columns.

> hubUtils::create_model_out_submit_tmpl(con,
+                                        round_id = "2023-10-14",
+                                        required_vals_only = TRUE,
+                                        complete_cases_only = FALSE) |> dput()
#> ! Columns "target", "location", and "target_end_date" whose values are all optional included as all `NA`
#>   columns.
#> ! Round contains more than one modeling task (2)
#> ℹ See Hub's tasks.json file or <hub_connection> attribute "config_tasks" for details of optional task
#>   ID/output_type/output_type ID value combinations.
#> # A tibble: 28 × 8
#>    reference_date target horizon location target_end_date output_type
#>    <date>         <chr>    <int> <chr>    <date>          <chr>      
#>  1 2023-10-14     <NA>        NA <NA>     NA              pmf        
#>  2 2023-10-14     <NA>        NA <NA>     NA              pmf        
#>  3 2023-10-14     <NA>        NA <NA>     NA              pmf        
#>  4 2023-10-14     <NA>        NA <NA>     NA              pmf        
#>  5 2023-10-14     <NA>        NA <NA>     NA              pmf        
#>  6 2023-10-14     <NA>         1 <NA>     NA              quantile   
#>  7 2023-10-14     <NA>         1 <NA>     NA              quantile   
#>  8 2023-10-14     <NA>         1 <NA>     NA              quantile   
#>  9 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 10 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 11 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 12 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 13 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 14 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 15 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 16 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 17 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 18 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 19 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 20 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 21 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 22 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 23 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 24 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 25 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 26 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 27 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> 28 2023-10-14     <NA>         1 <NA>     NA              quantile   
#> # ℹ 2 more variables: output_type_id <chr>, value <dbl>

Created on 2023-09-29 with reprex v2.0.2

This is a tricky issue and I'm not 100% sure how to proceed. The easiest way I can think of is to be able to ignore task IDs in certain situations like these. Will likely need some time to fix though as it likely needs work on complex function across hubUtils and hubValidations. Keen on hearing thoughts!