hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 4 forks source link

Help interpreting required value validation of rows with multiple optional values. #17

Open annakrystalli opened 1 year ago

annakrystalli commented 1 year ago

In working on validating required values, I've come upon a question that needs clarification.

I have the following file: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/mvp-1/inst/testhubs/simple/model-output/team1-goodmodel/2022-10-08-team1-goodmodel.csv

hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
tbl <- hubValidations::read_model_out_file(file_path, hub_path)
print(tbl, n = 50)
#> # A tibble: 47 × 7
#>    origin_date target          horizon location output_type output_type_id value
#>    <date>      <chr>             <int> <chr>    <chr>                <dbl> <int>
#>  1 2022-10-08  wk inc flu hosp       1 US       quantile             0.01    135
#>  2 2022-10-08  wk inc flu hosp       1 US       quantile             0.025   137
#>  3 2022-10-08  wk inc flu hosp       1 US       quantile             0.05    139
#>  4 2022-10-08  wk inc flu hosp       1 US       quantile             0.1     140
#>  5 2022-10-08  wk inc flu hosp       1 US       quantile             0.15    141
#>  6 2022-10-08  wk inc flu hosp       1 US       quantile             0.2     141
#>  7 2022-10-08  wk inc flu hosp       1 US       quantile             0.25    142
#>  8 2022-10-08  wk inc flu hosp       1 US       quantile             0.3     143
#>  9 2022-10-08  wk inc flu hosp       1 US       quantile             0.35    144
#> 10 2022-10-08  wk inc flu hosp       1 US       quantile             0.4     145
#> 11 2022-10-08  wk inc flu hosp       1 US       quantile             0.45    147
#> 12 2022-10-08  wk inc flu hosp       1 US       quantile             0.5     148
#> 13 2022-10-08  wk inc flu hosp       1 US       quantile             0.55    149
#> 14 2022-10-08  wk inc flu hosp       1 US       quantile             0.6     150
#> 15 2022-10-08  wk inc flu hosp       1 US       quantile             0.65    152
#> 16 2022-10-08  wk inc flu hosp       1 US       quantile             0.7     155
#> 17 2022-10-08  wk inc flu hosp       1 US       quantile             0.75    161
#> 18 2022-10-08  wk inc flu hosp       1 US       quantile             0.8     165
#> 19 2022-10-08  wk inc flu hosp       1 US       quantile             0.85    170
#> 20 2022-10-08  wk inc flu hosp       1 US       quantile             0.9     175
#> 21 2022-10-08  wk inc flu hosp       1 US       quantile             0.95    176
#> 22 2022-10-08  wk inc flu hosp       1 US       quantile             0.975   176
#> 23 2022-10-08  wk inc flu hosp       1 US       quantile             0.99    205
#> 24 2022-10-08  wk inc flu hosp       1 02       quantile             0.01    123
#> 25 2022-10-08  wk inc flu hosp       1 02       quantile             0.025   130
#> 26 2022-10-08  wk inc flu hosp       1 02       quantile             0.05    135
#> 27 2022-10-08  wk inc flu hosp       1 02       quantile             0.1     141
#> 28 2022-10-08  wk inc flu hosp       1 02       quantile             0.15    146
#> 29 2022-10-08  wk inc flu hosp       1 02       quantile             0.2     149
#> 30 2022-10-08  wk inc flu hosp       1 02       quantile             0.25    152
#> 31 2022-10-08  wk inc flu hosp       1 02       quantile             0.3     154
#> 32 2022-10-08  wk inc flu hosp       1 02       quantile             0.35    157
#> 33 2022-10-08  wk inc flu hosp       1 02       quantile             0.4     159
#> 34 2022-10-08  wk inc flu hosp       1 02       quantile             0.45    161
#> 35 2022-10-08  wk inc flu hosp       1 02       quantile             0.5     163
#> 36 2022-10-08  wk inc flu hosp       1 02       quantile             0.55    165
#> 37 2022-10-08  wk inc flu hosp       1 02       quantile             0.6     168
#> 38 2022-10-08  wk inc flu hosp       1 02       quantile             0.65    170
#> 39 2022-10-08  wk inc flu hosp       1 02       quantile             0.7     172
#> 40 2022-10-08  wk inc flu hosp       1 02       quantile             0.75    175
#> 41 2022-10-08  wk inc flu hosp       1 02       quantile             0.8     178
#> 42 2022-10-08  wk inc flu hosp       1 02       quantile             0.85    181
#> 43 2022-10-08  wk inc flu hosp       1 02       quantile             0.9     185
#> 44 2022-10-08  wk inc flu hosp       1 02       quantile             0.95    191
#> 45 2022-10-08  wk inc flu hosp       1 02       quantile             0.975   197
#> 46 2022-10-08  wk inc flu hosp       1 02       quantile             0.99    203
#> 47 2022-10-08  wk inc flu hosp       1 02       mean                NA       173

Created on 2023-08-02 with reprex v2.0.2

The config creates an expanded grid of required values that spans all columns and the file contains the full grid of expected required values.

library(hubUtils)
hub_path <- system.file("testhubs/simple", package = "hubValidations")
expand_model_out_val_grid(
    read_config(hub_path, "tasks"),
    round_id = "2022-10-08",
    required_vals_only = TRUE
) %>%
    print(n = 25)
#> # A tibble: 23 × 6
#>    origin_date target          horizon location output_type output_type_id
#>    <date>      <chr>             <int> <chr>    <chr>                <dbl>
#>  1 2022-10-08  wk inc flu hosp       1 US       quantile             0.01 
#>  2 2022-10-08  wk inc flu hosp       1 US       quantile             0.025
#>  3 2022-10-08  wk inc flu hosp       1 US       quantile             0.05 
#>  4 2022-10-08  wk inc flu hosp       1 US       quantile             0.1  
#>  5 2022-10-08  wk inc flu hosp       1 US       quantile             0.15 
#>  6 2022-10-08  wk inc flu hosp       1 US       quantile             0.2  
#>  7 2022-10-08  wk inc flu hosp       1 US       quantile             0.25 
#>  8 2022-10-08  wk inc flu hosp       1 US       quantile             0.3  
#>  9 2022-10-08  wk inc flu hosp       1 US       quantile             0.35 
#> 10 2022-10-08  wk inc flu hosp       1 US       quantile             0.4  
#> 11 2022-10-08  wk inc flu hosp       1 US       quantile             0.45 
#> 12 2022-10-08  wk inc flu hosp       1 US       quantile             0.5  
#> 13 2022-10-08  wk inc flu hosp       1 US       quantile             0.55 
#> 14 2022-10-08  wk inc flu hosp       1 US       quantile             0.6  
#> 15 2022-10-08  wk inc flu hosp       1 US       quantile             0.65 
#> 16 2022-10-08  wk inc flu hosp       1 US       quantile             0.7  
#> 17 2022-10-08  wk inc flu hosp       1 US       quantile             0.75 
#> 18 2022-10-08  wk inc flu hosp       1 US       quantile             0.8  
#> 19 2022-10-08  wk inc flu hosp       1 US       quantile             0.85 
#> 20 2022-10-08  wk inc flu hosp       1 US       quantile             0.9  
#> 21 2022-10-08  wk inc flu hosp       1 US       quantile             0.95 
#> 22 2022-10-08  wk inc flu hosp       1 US       quantile             0.975
#> 23 2022-10-08  wk inc flu hosp       1 US       quantile             0.99

Created on 2023-08-02 with reprex v2.0.2

What I'm having trouble with is the validation of the single mean row: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/mvp-1/inst/testhubs/simple/model-output/team1-goodmodel/2022-10-08-team1-goodmodel.csv#L48

hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
tbl <- hubValidations::read_model_out_file(file_path, hub_path)
tail(tbl, 1)
#> # A tibble: 1 × 7
#>   origin_date target          horizon location output_type output_type_id value
#>   <date>      <chr>             <int> <chr>    <chr>                <dbl> <int>
#> 1 2022-10-08  wk inc flu hosp       1 02       mean                    NA   173

Created on 2023-08-02 with reprex v2.0.2

The mean output is optional: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/62ecf49fe4fb803163e43d387dae6b8567dd3c98/inst/testhubs/simple/hub-config/tasks.json#L79-L84

as is location "02": https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/62ecf49fe4fb803163e43d387dae6b8567dd3c98/inst/testhubs/simple/hub-config/tasks.json#L20-L24

Currently, the mean row passes check_tbl_values_required() validation. It somehow feels like it shouldn't but I'm not 100% sure how the config should be interpreted regarding multiple optional values.

My instinct tells me that if a mean is provided for "02" location, then it should also be provided for "US" which is required.

Having said that, I also considered a situation where we had both optional and required quantile type IDs and a team submitted some values for optional quantile type IDs (as well as all required quantile type IDs) for an optional location but not for required location "US". In this instance, it somehow feels strict to fail the validation.

Any insight would be greatly appreciated @elray1 & @LucieContamin !

Relatedly, do we treat tasks IDs and output types separately when thinking about this issue?

elray1 commented 1 year ago

Good questions! I don't think we've ever been very clear about this. Jumping to your last question first: although we haven't said this before, after thinking this example through I think maybe we should treat task IDs and output types separately when thinking about this issue.

r.e. "example 0": 'if a mean is provided for "02" location, then it should also be provided for "US" which is required.':

On the other hand, I do think there are similar situations where this kind of "expanding requirement" should show up.

I think the basic difference between example 0 and examples 1/2/3 is whether or not we were thinking about an optional output_type_id as a "basis" for expanding out to required values of task id variables or output_type_ids. Example 0 implies that we do not want to do this: if an optional output_type_id value is provided for one location, we do not want to make it required for other locations. On the other hand, examples 1/2/3 imply that if an optional task id variable value is included, we do want to be sure we collect any combinations of that task id value with the required values for other task ids and/or output types.

So, here's a proposal for how we could set this up. Suppose there are one or more "group A" task id variables with at least some optional values, and one or more other "group B" task id variables or output types with some values or output_type_ids that are required. If a submission includes a row that has some combination of optional values for group A columns, it must also include all rows that have that same combination of values for the group A columns and all possible combinations of required values for the group B columns/output_type_ids.

LucieContamin commented 1 year ago

I agree with @elray1 for the "example 0" : "it's OK if the submission includes a mean for location "02" but not "US"".

For the other examples: If I understand correctly, in this case we have in the tasks.json:

"location": { 
     "required": ["US"], 
     "optional": [ 
         "01", 
         "02", 
         ..]
     } ,
"target": {
     "required": ["inc death", "inc hosp"],
      "optional": null
},
"horizon": {
       "required": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
         "optional": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
}   

and for example:

"output_type": {
            "mean": {
              "output_type_id": {
                "required": null ,
                "optional": ["NA"]
              },
              "value" : {
                "type": "double",
                "minimum": 0
              }
            },
            "quantile" : {
              "output_type_id": {
                "required": [0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.975, 0.99],
                "optional": null
              },
              "value": {
                "type": "double",
                "minimum": 0
              }
            }
          }

So, I agree that in this case, example 1 , 2 and 3 should fail (or at least returns a warning for the example 3). In this situation I read the tasks json information as:

My understanding also is that the validation works by "model_tasks" group. So if you want a different behavior, you have to split the round information into multiple "tasks_id". For example: In the same round:

Does that make sense and answer the question?

annakrystalli commented 1 year ago

Thank you both!! This has been really helpful...now on to try and come up with as elegant an implementation as possible 😜

Will get back to you if I have follow up questions!