Closed LucieContamin closed 7 months ago
So I feel a task id that has both required
and optional
set to null
should fail. Do you agree @LucieContamin @elray1 & @nickreich ?
Indeed I thought validate_config
was already checking that but looking into it it's not! So I will add that check if you agree.
I think we should allow to have the required and optional set to null
for the target that are not related to any time information.
For example, for FLU, we have a peak size target that is just the magnitude of the peak for the whole season. So the "horizon" or "target_date" column are empty for that target. However, the round contains other targets (for example incident hospitalization) that are weekly and require specific value in the "horizon" or "target_date" column.
Oh interesting. This feels like a bigger question because it has additional config as well as model output validation implications. I'm not 100% null is right either, it might be that one or the other should be NA
(as is for mean
etc). It feels like something to add to a dev meeting agenda
Here's an artificial/simple example with my understanding of the issue: We have a hub with two task ids:
target
, with two targets, "a"
and "b"
target_a_subtask
, which specifies some further refinement that's only relevant when target == "a"
The question is what to list for the optional and required values for the target_a_subtask
column for the rows with target == "b"
, where that value is not relevant. My take is that it would be good to specify something, so that all submission files do the same thing.
If we'd like submissions for target b to be required, we could maybe do something like:
...
"rounds": [
{
...
"model_tasks": [
{
"task_ids": {
"target": {
"required": ["a"],
"optional": null
},
"target_a_subtask": {
"required": ["req1", "req2"],
"optional": ["opt3", "opt4"]
}
}
},
{
"task_ids": {
"target": {
"required": ["b"],
"optional": null
},
"target_a_subtask": {
"required": [""],
"optional": null
}
}
},
]
...
Where I am not totally sure that works, but the idea is that we are requiring submissions for target b, with an empty string in the target_a_subtask
column for that target. If empty strings are not consistently interpretable, maybe we would need to have "required": ["NA"]
, and submission files would have to put "NA"
in their files.
Or if submission for target b is optional, either/both of the entries under that target group could be moved from "required"
to "optional"
.
If we instead list
...
"rounds": [
{
...
"model_tasks": [
{
"task_ids": {
"target": {
"required": ["a"],
"optional": null
},
"target_a_subtask": {
"required": ["req1", "req2"],
"optional": ["opt3", "opt4"]
}
}
},
{
"task_ids": {
"target": {
"required": ["b"],
"optional": null
},
"target_a_subtask": {
"required": null,
"optional": null
}
}
},
]
...
my question is what value are we expecting to see in the target_a_subtask
column for rows where target == "b"
? We could maybe have a convention that this is to be interpreted as equivalent to the above, with either a ""
or an "NA"
entry under "required"
, so that the requirement status for that target group as a whole is determined by whether or not ["b"]
shows up under "required"
or "optional"
?
This is a very instructive example @elray1
Firstly completely agree that config for values of target
== a
most definitely would need to be split into a separate modelling task from target
== b
.
For me the most natural and simplest approach for coding values of target_a_subtask
when target
== b
would be to use ["NA"]
in one of required
or optional
properties as it feels that's what should appear in the actual file being validated.
Testing on the example @kjsato reported in https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/issues/75 which was using null
in both required
and optional
specifications and causing a validation EXEC error when trying to validate valid value combinations, this seems to now work:
> validate_submission(hub_path=".",file_path="teamsam-modelple/2023-11-26-teamsam-modelple.parquet")
✔ sample: All hub config files are valid.
✔ 2023-11-26-teamsam-modelple.parquet: File exists at path model-output/teamsam-modelple/2023-11-26-teamsam-modelple.parquet.
✔ 2023-11-26-teamsam-modelple.parquet: File name "2023-11-26-teamsam-modelple.parquet" is valid.
✔ 2023-11-26-teamsam-modelple.parquet: File directory name matches `model_id` metadata in file name.
✔ 2023-11-26-teamsam-modelple.parquet: `round_id` is valid.
✔ 2023-11-26-teamsam-modelple.parquet: File is accepted hub format.
✔ 2023-11-26-teamsam-modelple.parquet: Metadata file exists at path model-metadata/teamsam-modelple.yaml.
✔ 2023-11-26-teamsam-modelple.parquet: File could be read successfully.
✔ 2023-11-26-teamsam-modelple.parquet: `round_id_col` name is valid.
✔ 2023-11-26-teamsam-modelple.parquet: `round_id` column "origin_date" contains a single, unique round ID value.
✔ 2023-11-26-teamsam-modelple.parquet: All `round_id_col` "origin_date" values match submission `round_id` from file name.
✔ 2023-11-26-teamsam-modelple.parquet: Column names are consistent with expected round task IDs and std column names.
✔ 2023-11-26-teamsam-modelple.parquet: Column data types match hub schema.
✔ 2023-11-26-teamsam-modelple.parquet: `tbl` contains valid values/value combinations.
✔ 2023-11-26-teamsam-modelple.parquet: All combinations of task ID column/`output_type`/`output_type_id` values are unique.
✔ 2023-11-26-teamsam-modelple.parquet: Required task ID/output type/output type ID combinations all present.
✔ 2023-11-26-teamsam-modelple.parquet: Values in column `value` all valid with respect to modeling task config.
✔ 2023-11-26-teamsam-modelple.parquet: Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID value/output type
combinations of quantile or cdf output types.
ℹ 2023-11-26-teamsam-modelple.parquet: No pmf output types to check for sum of 1. Check skipped.
✔ 2023-11-26-teamsam-modelple.parquet: Submission time is within accepted submission window for round.
by producing the following expanded grid of valid values:
grid %>%
+ filter(is.na(horizon))
# A tibble: 10,800 × 7
origin_date target horizon location age_group output_type output_type_id
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2023-11-26 peak size hosp NA US 0-130 quantile 0.01
2 2023-11-26 peak size hosp NA 01 0-130 quantile 0.01
3 2023-11-26 peak size hosp NA 02 0-130 quantile 0.01
4 2023-11-26 peak size hosp NA 04 0-130 quantile 0.01
5 2023-11-26 peak size hosp NA 05 0-130 quantile 0.01
6 2023-11-26 peak size hosp NA 06 0-130 quantile 0.01
7 2023-11-26 peak size hosp NA 08 0-130 quantile 0.01
8 2023-11-26 peak size hosp NA 09 0-130 quantile 0.01
9 2023-11-26 peak size hosp NA 10 0-130 quantile 0.01
10 2023-11-26 peak size hosp NA 11 0-130 quantile 0.01
# ℹ 10,790 more rows
# ℹ Use `print(n = ...)` to see more rows
Here's the tasks.jaon
that works where ["NA"]
is used in a couple of horizon
specifications:
{
"schema_version": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/v2.0.0/tasks-schema.json",
"rounds": [
{
"round_id_from_variable": true,
"round_id": "origin_date",
"model_tasks": [
{
"task_ids": {
"origin_date": {
"required": null,
"optional": [
"2023-11-12", "2023-11-19", "2023-11-26", "2023-12-03",
"2023-12-10", "2023-12-17", "2023-12-24", "2023-12-31",
"2024-01-07", "2024-01-14", "2024-01-21", "2024-01-28",
"2024-02-05", "2024-02-11", "2024-02-18", "2024-02-25",
"2024-03-03", "2024-03-10", "2024-03-17", "2024-03-24",
"2024-03-31", "2024-04-07", "2024-04-14", "2024-04-21",
"2024-04-28", "2024-05-05", "2024-05-12"
]
},
"target": {
"required": ["inc hosp"],
"optional": null
},
"horizon": {
"required": [1, 2, 3, 4],
"optional": [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
},
"location": {
"required": null,
"optional": [
"US",
"01",
"02",
"04",
"05",
"06",
"08",
"09",
"10",
"11",
"12",
"13",
"15",
"16",
"17",
"18",
"19",
"20",
"21",
"22",
"23",
"24",
"25",
"26",
"27",
"28",
"29",
"30",
"31",
"32",
"33",
"34",
"35",
"36",
"37",
"38",
"39",
"40",
"41",
"42",
"44",
"45",
"46",
"47",
"48",
"49",
"50",
"51",
"53",
"54",
"55",
"56",
"72",
"78"
]
},
"age_group":{
"required":["0-130"],
"optional":["0-0.99","1-4","5-17","5-64","18-49","50-64","65-130"]
}
},
"output_type": {
"quantile":{
"output_type_id":{
"required": [
0.01,
0.025,
0.05,
0.1,
0.15,
0.2,
0.25,
0.3,
0.35,
0.4,
0.45,
0.5,
0.55,
0.6,
0.65,
0.7,
0.75,
0.8,
0.85,
0.9,
0.95,
0.975,
0.99
],
"optional":null
},
"value":{
"type":"double",
"minimum":0
}
}
},
"target_metadata": [
{
"target_id": "inc hosp",
"target_name": "Weekly incident RSV hospitalizations",
"target_units": "count",
"target_keys": {
"target": ["inc hosp"]
},
"target_type": "continuous",
"is_step_ahead": true,
"time_unit": "week"
}
]
},
{
"task_ids": {
"origin_date": {
"required": null,
"optional": [
"2023-11-12", "2023-11-19", "2023-11-26", "2023-12-03",
"2023-12-10", "2023-12-17", "2023-12-24", "2023-12-31",
"2024-01-07", "2024-01-14", "2024-01-21", "2024-01-28",
"2024-02-05", "2024-02-11", "2024-02-18", "2024-02-25",
"2024-03-03", "2024-03-10", "2024-03-17", "2024-03-24",
"2024-03-31", "2024-04-07", "2024-04-14", "2024-04-21",
"2024-04-28", "2024-05-05", "2024-05-12"
]
},
"target": {
"required": null,
"optional": ["inc hosp", "cum hosp"]
},
"horizon": {
"required": [1, 2, 3, 4],
"optional": [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
},
"location": {
"required": null,
"optional": [
"US",
"01",
"02",
"04",
"05",
"06",
"08",
"09",
"10",
"11",
"12",
"13",
"15",
"16",
"17",
"18",
"19",
"20",
"21",
"22",
"23",
"24",
"25",
"26",
"27",
"28",
"29",
"30",
"31",
"32",
"33",
"34",
"35",
"36",
"37",
"38",
"39",
"40",
"41",
"42",
"44",
"45",
"46",
"47",
"48",
"49",
"50",
"51",
"53",
"54",
"55",
"56",
"72",
"78"
]
},
"age_group":{
"required":["0-130"],
"optional":["0-0.99","1-4","5-17","5-64","18-49","50-64","65-130"]
}
},
"output_type": {
"sample":{
"output_type_id":{
"required": null,
"optional":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
},
"value":{
"type":"double",
"minimum":0
}
}
},
"target_metadata": [
{
"target_id": "inc hosp",
"target_name": "Weekly incident RSV hospitalizations",
"target_units": "count",
"target_keys": {
"target": ["inc hosp"]
},
"target_type": "discrete",
"is_step_ahead": true,
"time_unit": "week"
},
{
"target_id": "cum hosp",
"target_name":"Weekly incident cumulative RSV hospitalizations",
"target_units":"count",
"target_keys":{
"target":["cum hosp"]
},
"target_type": "discrete",
"is_step_ahead": true,
"time_unit": "week"
}
]
},
{
"task_ids": {
"origin_date": {
"required": null,
"optional": [
"2023-11-12", "2023-11-19", "2023-11-26", "2023-12-03",
"2023-12-10", "2023-12-17", "2023-12-24", "2023-12-31",
"2024-01-07", "2024-01-14", "2024-01-21", "2024-01-28",
"2024-02-05", "2024-02-11", "2024-02-18", "2024-02-25",
"2024-03-03", "2024-03-10", "2024-03-17", "2024-03-24",
"2024-03-31", "2024-04-07", "2024-04-14", "2024-04-21",
"2024-04-28", "2024-05-05", "2024-05-12"
]
},
"target": {
"required": null,
"optional": ["peak size hosp"]
},
"horizon": {
"required": null,
"optional": ["NA"]
},
"location": {
"required": null,
"optional": [
"US",
"01",
"02",
"04",
"05",
"06",
"08",
"09",
"10",
"11",
"12",
"13",
"15",
"16",
"17",
"18",
"19",
"20",
"21",
"22",
"23",
"24",
"25",
"26",
"27",
"28",
"29",
"30",
"31",
"32",
"33",
"34",
"35",
"36",
"37",
"38",
"39",
"40",
"41",
"42",
"44",
"45",
"46",
"47",
"48",
"49",
"50",
"51",
"53",
"54",
"55",
"56",
"72",
"78"
]
},
"age_group":{
"required":["0-130"],
"optional":["0-0.99","1-4","5-17","5-64","18-49","50-64","65-130"]
}
},
"output_type": {
"quantile":{
"output_type_id":{
"required":[0.01,0.025,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,0.975,0.99],
"optional":[0,1]
},
"value":{
"type":"double",
"minimum":0
}
}
},
"target_metadata": [
{
"target_id": "peak size hosp",
"target_name": "Peak size of hospitalization",
"target_units": "count",
"target_keys": {
"target": ["peak size hosp"]
},
"target_type": "discrete",
"is_step_ahead": false
}
]
},
{
"task_ids": {
"origin_date": {
"required": null,
"optional": [
"2023-11-12", "2023-11-19", "2023-11-26", "2023-12-03",
"2023-12-10", "2023-12-17", "2023-12-24", "2023-12-31",
"2024-01-07", "2024-01-14", "2024-01-21", "2024-01-28",
"2024-02-05", "2024-02-11", "2024-02-18", "2024-02-25",
"2024-03-03", "2024-03-10", "2024-03-17", "2024-03-24",
"2024-03-31", "2024-04-07", "2024-04-14", "2024-04-21",
"2024-04-28", "2024-05-05", "2024-05-12"
]
},
"target": {
"required": null,
"optional": ["peak time hosp"]
},
"horizon": {
"required": null,
"optional": ["NA"]
},
"location": {
"required": null,
"optional": [
"US",
"01",
"02",
"04",
"05",
"06",
"08",
"09",
"10",
"11",
"12",
"13",
"15",
"16",
"17",
"18",
"19",
"20",
"21",
"22",
"23",
"24",
"25",
"26",
"27",
"28",
"29",
"30",
"31",
"32",
"33",
"34",
"35",
"36",
"37",
"38",
"39",
"40",
"41",
"42",
"44",
"45",
"46",
"47",
"48",
"49",
"50",
"51",
"53",
"54",
"55",
"56",
"72",
"78"
]
},
"age_group":{
"required":["0-130"],
"optional":["0-0.99","1-4","5-17","5-64","18-49","50-64","65-130"]
}
},
"output_type": {
"cdf":{
"output_type_id":{
"required":null,
"optional":[1]
},
"value":{
"type":"double",
"minimum":0,
"maximum":1
}
}
},
"target_metadata": [
{
"target_id": "peak time hosp",
"target_name": "Peak timing of hospitalization",
"target_units": "population",
"target_keys": {
"target": ["peak time hosp"]
},
"target_type": "discrete",
"is_step_ahead": true,
"time_unit": "week"
}
]
}
],
"submissions_due": {
"relative_to": "origin_date",
"start": -6,
"end": 100
}
}
]
}
I agree with having "NA" instead of both required and optional set to null
. However, I do wonder what will be the behavior of "NA" in other language like for example Python (for example this reticulate issue: https://github.com/rstudio/reticulate/issues/197) and in a very rare case, what will happen if we want "NA" as a value not the logical NA
meaning (the only case I can think of is Namibia location code, so we might not want to worry about that).
For the coding values of target_a_subtask
when target == b
and b is optional, I would use ["NA"] as required. For example, in the US SMH, we have optional target but if you provide them, it's required to provide some age group or quantile, but some other are optional (see example below).
So for me, if a target is optional all the other tasks id are read as:
null
: at least one value of the listSo, I will put "NA"
in required, because we require NA
if the target is provided and don't accept any other value in the column, does that make sense? and is it not correct?
"task_ids": {
"origin_date": {
"required": ["2023-09-03"],
"optional": null
},
"scenario_id": {
"required": ["A-2023-08-14", "B-2023-08-14", "C-2023-08-14", ...],
"optional": null
},
"location": {
"required": ["US"],
"optional": null
},
"target": {
"required": null,
"optional": ["inc death", "cum death"]
},
"horizon": {
"required": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...],
"optional": null
},
"age_group": {
"required": ["0-130"],
"optional": ["0-4", "5-17", "18-49", ....]
}
}
....
"task_ids": {
"origin_date": {
"required": ["2023-09-03"],
"optional": null
},
"scenario_id": {
"required": ["A-2023-08-14", "B-2023-08-14", "C-2023-08-14", ...],
"optional": null
},
"location": {
"required": null,
"optional": ["US", "01", "02", "04", "05", "06", "08", "09", ...]
},
"target": {
"required": null,
"optional": ["peak size hosp"]
},
"horizon": {
"required": ["NA"],
"optional": null
},
"age_group": {
"required": ["0-130"],
"optional": null
}
}
Hey @LucieContamin !
Re the concerns regarding NA
s as config task ID values.
If the "NA"
which needs to be treated as character is part of a character array with other values, it will correctly be read in as "NA"
from JSON. It only gets converted to NA
when is is the only value in an array as is effectively unboxed when read in.
jsonlite::fromJSON('{"required": ["NA"]}')
#> $required
#> [1] NA
jsonlite::fromJSON('{"required": ["NA", "MA"]}')
#> $required
#> [1] "NA" "MA"
Created on 2024-03-04 with reprex v2.0.2
I have not found a quick solution to the situation where an "NA"
character code is the only value in the array so this could indeed be problematic in this rare situation. Potentially there might be workarounds we could develop if necessary though.
Regarding the NA
s and python, that's a very interesting topic indeed but I don't think it's specific to this issue as we already use NA
s in our median and mean output type ID specifications. This may well be a bridge we have to cross more generally though when python versions of our tools are developed.
Finally re: NA
s in required
vs optional
, what you describes indeed makes sense to me. I kinda feels it might be a good idea to have a whole page in our hubDocs about this topic!
I'm now actually changing my mind about "NA"
because it would cause problems when validating the config of non-character task IDs and we don't want all task IDs to also possibly be character to accomodate the potential for "NA"
as that sort of makes the whole data type checking redundant/weaker.
Instead what I propose is that we indeed use null
for both required
and optional
and handle that internally. I.e. where the expand grid function is failing in https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/issues/75 , if both required
and optional
are NULL when the config is read in, it would convert the value of required
to NA
.
I feel in the end this is the cleanest way and get's round both the "NA"
as a task ID value and configs in python issue too.
It sounds good. I think it makes sense.
Great. If I made a start on implementing it, would you help me with documenting it?
sure!
@annakrystalli thanks for informing, I would appreciate it if you would update it!
I have a tasks.json that contains the information for a particular optional target:
and the tasks.json passes the validation without issues:
For additional information, the target has no horizon information (no date).
However when I try to reproduce the json with the
hubUtils::create_task_id()
function, I have an error:Created on 2023-11-02 with reprex v2.0.2
The function documentation indicates that both cannot NULL so I am not sure on which is incorrect?