Help interpreting required value validation of rows with multiple optional values.

annakrystalli commented 1 year ago

In working on validating required values, I've come upon a question that needs clarification.

I have the following file: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/mvp-1/inst/testhubs/simple/model-output/team1-goodmodel/2022-10-08-team1-goodmodel.csv

hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
tbl <- hubValidations::read_model_out_file(file_path, hub_path)
print(tbl, n = 50)
#> # A tibble: 47 × 7
#>    origin_date target          horizon location output_type output_type_id value
#>    <date>      <chr>             <int> <chr>    <chr>                <dbl> <int>
#>  1 2022-10-08  wk inc flu hosp       1 US       quantile             0.01    135
#>  2 2022-10-08  wk inc flu hosp       1 US       quantile             0.025   137
#>  3 2022-10-08  wk inc flu hosp       1 US       quantile             0.05    139
#>  4 2022-10-08  wk inc flu hosp       1 US       quantile             0.1     140
#>  5 2022-10-08  wk inc flu hosp       1 US       quantile             0.15    141
#>  6 2022-10-08  wk inc flu hosp       1 US       quantile             0.2     141
#>  7 2022-10-08  wk inc flu hosp       1 US       quantile             0.25    142
#>  8 2022-10-08  wk inc flu hosp       1 US       quantile             0.3     143
#>  9 2022-10-08  wk inc flu hosp       1 US       quantile             0.35    144
#> 10 2022-10-08  wk inc flu hosp       1 US       quantile             0.4     145
#> 11 2022-10-08  wk inc flu hosp       1 US       quantile             0.45    147
#> 12 2022-10-08  wk inc flu hosp       1 US       quantile             0.5     148
#> 13 2022-10-08  wk inc flu hosp       1 US       quantile             0.55    149
#> 14 2022-10-08  wk inc flu hosp       1 US       quantile             0.6     150
#> 15 2022-10-08  wk inc flu hosp       1 US       quantile             0.65    152
#> 16 2022-10-08  wk inc flu hosp       1 US       quantile             0.7     155
#> 17 2022-10-08  wk inc flu hosp       1 US       quantile             0.75    161
#> 18 2022-10-08  wk inc flu hosp       1 US       quantile             0.8     165
#> 19 2022-10-08  wk inc flu hosp       1 US       quantile             0.85    170
#> 20 2022-10-08  wk inc flu hosp       1 US       quantile             0.9     175
#> 21 2022-10-08  wk inc flu hosp       1 US       quantile             0.95    176
#> 22 2022-10-08  wk inc flu hosp       1 US       quantile             0.975   176
#> 23 2022-10-08  wk inc flu hosp       1 US       quantile             0.99    205
#> 24 2022-10-08  wk inc flu hosp       1 02       quantile             0.01    123
#> 25 2022-10-08  wk inc flu hosp       1 02       quantile             0.025   130
#> 26 2022-10-08  wk inc flu hosp       1 02       quantile             0.05    135
#> 27 2022-10-08  wk inc flu hosp       1 02       quantile             0.1     141
#> 28 2022-10-08  wk inc flu hosp       1 02       quantile             0.15    146
#> 29 2022-10-08  wk inc flu hosp       1 02       quantile             0.2     149
#> 30 2022-10-08  wk inc flu hosp       1 02       quantile             0.25    152
#> 31 2022-10-08  wk inc flu hosp       1 02       quantile             0.3     154
#> 32 2022-10-08  wk inc flu hosp       1 02       quantile             0.35    157
#> 33 2022-10-08  wk inc flu hosp       1 02       quantile             0.4     159
#> 34 2022-10-08  wk inc flu hosp       1 02       quantile             0.45    161
#> 35 2022-10-08  wk inc flu hosp       1 02       quantile             0.5     163
#> 36 2022-10-08  wk inc flu hosp       1 02       quantile             0.55    165
#> 37 2022-10-08  wk inc flu hosp       1 02       quantile             0.6     168
#> 38 2022-10-08  wk inc flu hosp       1 02       quantile             0.65    170
#> 39 2022-10-08  wk inc flu hosp       1 02       quantile             0.7     172
#> 40 2022-10-08  wk inc flu hosp       1 02       quantile             0.75    175
#> 41 2022-10-08  wk inc flu hosp       1 02       quantile             0.8     178
#> 42 2022-10-08  wk inc flu hosp       1 02       quantile             0.85    181
#> 43 2022-10-08  wk inc flu hosp       1 02       quantile             0.9     185
#> 44 2022-10-08  wk inc flu hosp       1 02       quantile             0.95    191
#> 45 2022-10-08  wk inc flu hosp       1 02       quantile             0.975   197
#> 46 2022-10-08  wk inc flu hosp       1 02       quantile             0.99    203
#> 47 2022-10-08  wk inc flu hosp       1 02       mean                NA       173

^{Created on 2023-08-02 with reprex v2.0.2}

The config creates an expanded grid of required values that spans all columns and the file contains the full grid of expected required values.

library(hubUtils)
hub_path <- system.file("testhubs/simple", package = "hubValidations")
expand_model_out_val_grid(
    read_config(hub_path, "tasks"),
    round_id = "2022-10-08",
    required_vals_only = TRUE
) %>%
    print(n = 25)
#> # A tibble: 23 × 6
#>    origin_date target          horizon location output_type output_type_id
#>    <date>      <chr>             <int> <chr>    <chr>                <dbl>
#>  1 2022-10-08  wk inc flu hosp       1 US       quantile             0.01 
#>  2 2022-10-08  wk inc flu hosp       1 US       quantile             0.025
#>  3 2022-10-08  wk inc flu hosp       1 US       quantile             0.05 
#>  4 2022-10-08  wk inc flu hosp       1 US       quantile             0.1  
#>  5 2022-10-08  wk inc flu hosp       1 US       quantile             0.15 
#>  6 2022-10-08  wk inc flu hosp       1 US       quantile             0.2  
#>  7 2022-10-08  wk inc flu hosp       1 US       quantile             0.25 
#>  8 2022-10-08  wk inc flu hosp       1 US       quantile             0.3  
#>  9 2022-10-08  wk inc flu hosp       1 US       quantile             0.35 
#> 10 2022-10-08  wk inc flu hosp       1 US       quantile             0.4  
#> 11 2022-10-08  wk inc flu hosp       1 US       quantile             0.45 
#> 12 2022-10-08  wk inc flu hosp       1 US       quantile             0.5  
#> 13 2022-10-08  wk inc flu hosp       1 US       quantile             0.55 
#> 14 2022-10-08  wk inc flu hosp       1 US       quantile             0.6  
#> 15 2022-10-08  wk inc flu hosp       1 US       quantile             0.65 
#> 16 2022-10-08  wk inc flu hosp       1 US       quantile             0.7  
#> 17 2022-10-08  wk inc flu hosp       1 US       quantile             0.75 
#> 18 2022-10-08  wk inc flu hosp       1 US       quantile             0.8  
#> 19 2022-10-08  wk inc flu hosp       1 US       quantile             0.85 
#> 20 2022-10-08  wk inc flu hosp       1 US       quantile             0.9  
#> 21 2022-10-08  wk inc flu hosp       1 US       quantile             0.95 
#> 22 2022-10-08  wk inc flu hosp       1 US       quantile             0.975
#> 23 2022-10-08  wk inc flu hosp       1 US       quantile             0.99

^{Created on 2023-08-02 with reprex v2.0.2}

What I'm having trouble with is the validation of the single mean row: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/mvp-1/inst/testhubs/simple/model-output/team1-goodmodel/2022-10-08-team1-goodmodel.csv#L48

hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
tbl <- hubValidations::read_model_out_file(file_path, hub_path)
tail(tbl, 1)
#> # A tibble: 1 × 7
#>   origin_date target          horizon location output_type output_type_id value
#>   <date>      <chr>             <int> <chr>    <chr>                <dbl> <int>
#> 1 2022-10-08  wk inc flu hosp       1 02       mean                    NA   173

^{Created on 2023-08-02 with reprex v2.0.2}

The mean output is optional: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/62ecf49fe4fb803163e43d387dae6b8567dd3c98/inst/testhubs/simple/hub-config/tasks.json#L79-L84

as is location "02": https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/blob/62ecf49fe4fb803163e43d387dae6b8567dd3c98/inst/testhubs/simple/hub-config/tasks.json#L20-L24

Currently, the mean row passes check_tbl_values_required() validation. It somehow feels like it shouldn't but I'm not 100% sure how the config should be interpreted regarding multiple optional values.

My instinct tells me that if a mean is provided for "02" location, then it should also be provided for "US" which is required.

Having said that, I also considered a situation where we had both optional and required quantile type IDs and a team submitted some values for optional quantile type IDs (as well as all required quantile type IDs) for an optional location but not for required location "US". In this instance, it somehow feels strict to fail the validation.

Any insight would be greatly appreciated @elray1 & @LucieContamin !

Relatedly, do we treat tasks IDs and output types separately when thinking about this issue?

elray1 commented 1 year ago

Good questions! I don't think we've ever been very clear about this. Jumping to your last question first: although we haven't said this before, after thinking this example through I think maybe we should treat task IDs and output types separately when thinking about this issue.

r.e. "example 0": 'if a mean is provided for "02" location, then it should also be provided for "US" which is required.':

I think that with this hub's config file, the hub's intention is to say, "We require quantile forecasts for the US at horizon 1. Other locations/horizons and mean predictions are optional, but if you provide any forecast for a given location/horizon we want all quantiles. And, we want to be sure we collect at least horizon 1 for any provided locations and at least the US for any provided horizons." Or at least, I think this is the kind of requirement that hubs will want to be able to specify, so my vote is that we should allow to support that and I want to end up with an answer that it's OK if the submission includes a mean for location "02" but not "US".

On the other hand, I do think there are similar situations where this kind of "expanding requirement" should show up.

example 1: if a submission includes a quantile prediction for location "02" at quantile level 0.500, but not at the other required quantile levels, this should fail validations because the hub wants to collect all quantiles.
example 2: similarly, if a submission includes a mean for location "02", but is missing any of the required quantile levels, this should fail validations because the hub wants to collect all quantiles.
example 3: if a submission includes a mean prediction for location "02" at horizon 2 but not horizon 1, this should fail validations because the hub required horizon 1 (even though location "02" is optional and mean predictions are optional).

I think the basic difference between example 0 and examples 1/2/3 is whether or not we were thinking about an optional output_type_id as a "basis" for expanding out to required values of task id variables or output_type_ids. Example 0 implies that we do not want to do this: if an optional output_type_id value is provided for one location, we do not want to make it required for other locations. On the other hand, examples 1/2/3 imply that if an optional task id variable value is included, we do want to be sure we collect any combinations of that task id value with the required values for other task ids and/or output types.

So, here's a proposal for how we could set this up. Suppose there are one or more "group A" task id variables with at least some optional values, and one or more other "group B" task id variables or output types with some values or output_type_ids that are required. If a submission includes a row that has some combination of optional values for group A columns, it must also include all rows that have that same combination of values for the group A columns and all possible combinations of required values for the group B columns/output_type_ids.

LucieContamin commented 1 year ago

I agree with @elray1 for the "example 0" : "it's OK if the submission includes a mean for location "02" but not "US"".

For the other examples: If I understand correctly, in this case we have in the tasks.json:

"location": { 
     "required": ["US"], 
     "optional": [ 
         "01", 
         "02", 
         ..]
     } ,
"target": {
     "required": ["inc death", "inc hosp"],
      "optional": null
},
"horizon": {
       "required": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
         "optional": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
}

and for example:

"output_type": {
            "mean": {
              "output_type_id": {
                "required": null ,
                "optional": ["NA"]
              },
              "value" : {
                "type": "double",
                "minimum": 0
              }
            },
            "quantile" : {
              "output_type_id": {
                "required": [0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.975, 0.99],
                "optional": null
              },
              "value": {
                "type": "double",
                "minimum": 0
              }
            }
          }

So, I agree that in this case, example 1 , 2 and 3 should fail (or at least returns a warning for the example 3). In this situation I read the tasks json information as:

for whatever tasks_id group combination(optional (02) or required (US)): all the required values should be represented and it's possible to add other optional value (here mean)
if the optional "mean" output_type is provided: all the required tasks_id should be provided (so for example here, all required target, horizon, etc., associated with the location 02)

My understanding also is that the validation works by "model_tasks" group. So if you want a different behavior, you have to split the round information into multiple "tasks_id". For example: In the same round:

model_task 1: required samples output for incident death and hospitalization, for 12 week horizon (possible to do more) and US (possible to do other locations)

"target": {
          "required": ["inc death", "inc hosp"],
          "optional": null
},
"horizon": {
   "required": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
     "optional": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
},
"location": { 
 "required": ["US"], 
 "optional": [ 
     "01", 
     "02", 
     ..]
 }
...
"output_type": {
            "sample": {
                "output_type_id": {
                    "min_samples_per_task": 1,
                    "max_samples_per_task": 100
                    "samples_joint_across": [],
                },
                "value": {
                    "type": "integer",
                    "minimum": 0
                }
            }
        }

model_tasks2: possibility to add quantiles output for incident and cumulative death and hospitalization , for 12 week horizon (possible to do more) and US (possible to do other locations). However if one quantile is provided, all 23 quantiles should be provided for all horizon, target, and associated location.

"target": {
          "required": null,
          "optional": ["inc death", "inc hosp", "cum death", "cum hosp"]
},
"horizon": {
   "required": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
     "optional": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
},
"location": { 
 "required": ["US"], 
 "optional": [ 
     "01", 
     "02", 
     ..]
 }
...
"output_type": {
     "quantile": {
          "output_type_id": {
            "required": [0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.975, 0.99],
            "optional": null
          },
     "value" : {
            "type": "double",
            "minimum": 0
          }
        }
        }

Does that make sense and answer the question?

annakrystalli commented 1 year ago

Thank you both!! This has been really helpful...now on to try and come up with as elegant an implementation as possible 😜

Will get back to you if I have follow up questions!

hubverse-org / hubValidations

Help interpreting required value validation of rows with multiple optional values. #17