Closed elray1 closed 3 months ago
It has also been noted that within a hub, models might have different capacities for what dependence structure they are able to capture, so we should potentially allow for a per-model specification of this.
This might also differ across rounds.
Quick Q @elray1 , this isn't something you want to think about for v2.0.0 right?
yeah, i think we're too far from knowing exactly what we want to do here.
It may be simplest to leave this to model submission files rather than trying to encode this in a separate per-model or per-round configuration file. We could say that within one model output submission file, i.e. per each submission round, any rows with output_type = "sample"
and the same output_type_id
, i.e. the same sample index, are assumed to be from a single draw from a common joint distribution.
For example, consider a forecast hub with the following task ids:
A model that estimates a single joint distribution for all horizons, both locations, and both targets might submit something like the following to represent a collection of 3 samples from that joint distribution:
model_id target location horizon output_type output_type_id value
all_joint hosp MA 1 sample 1 3
all_joint hosp MA 1 sample 2 4
all_joint hosp MA 1 sample 3 5
all_joint hosp MA 2 sample 1 3
all_joint hosp MA 2 sample 2 4
all_joint hosp MA 2 sample 3 5
all_joint hosp CA 1 sample 1 3
all_joint hosp CA 1 sample 2 4
all_joint hosp CA 1 sample 3 5
all_joint hosp CA 2 sample 1 3
all_joint hosp CA 2 sample 2 4
all_joint hosp CA 2 sample 3 5
all_joint death MA 1 sample 1 3
all_joint death MA 1 sample 2 4
all_joint death MA 1 sample 3 5
all_joint death MA 2 sample 1 3
all_joint death MA 2 sample 2 4
all_joint death MA 2 sample 3 5
all_joint death CA 1 sample 1 3
all_joint death CA 1 sample 2 4
all_joint death CA 1 sample 3 5
all_joint death CA 2 sample 1 3
all_joint death CA 2 sample 2 4
all_joint death CA 2 sample 3 5
In this submission, all of the value
s in rows with the output_type_id
value of 1
should be regarded as being a vector that is a single draw from the joint predictive distribution across target, location, and horizon.
On the other hand, a model that obtains separate fits for each target and location, but obtains sample trajectories across the horizons would have a submission with distinct output_type_id
s for the different target/location combinations, but that are shared across the different horizons:
model_id target location horizon output_type output_type_id value
horizon_joint hosp MA 1 sample 1 3
horizon_joint hosp MA 1 sample 2 4
horizon_joint hosp MA 1 sample 3 5
horizon_joint hosp MA 2 sample 1 3
horizon_joint hosp MA 2 sample 2 4
horizon_joint hosp MA 2 sample 3 5
horizon_joint hosp CA 1 sample 4 3
horizon_joint hosp CA 1 sample 5 4
horizon_joint hosp CA 1 sample 6 5
horizon_joint hosp CA 2 sample 4 3
horizon_joint hosp CA 2 sample 5 4
horizon_joint hosp CA 2 sample 6 5
horizon_joint death MA 1 sample 7 3
horizon_joint death MA 1 sample 8 4
horizon_joint death MA 1 sample 9 5
horizon_joint death MA 2 sample 7 3
horizon_joint death MA 2 sample 8 4
horizon_joint death MA 2 sample 9 5
horizon_joint death CA 1 sample 10 3
horizon_joint death CA 1 sample 11 4
horizon_joint death CA 1 sample 12 5
horizon_joint death CA 2 sample 10 3
horizon_joint death CA 2 sample 11 4
horizon_joint death CA 2 sample 12 5
In this submission, the two value
s in rows with the output_type_id
value of 1
should be regarded as being a vector that is a single draw from the joint distribution across horizons 1 and 2 that is specific to the hosp target for location MA.
Finally, a model that obtains separate marginal distributions for each target, location, and horizon would use a distinct output_type_id
in every row of its submission to indicate that the samples are all different draws from different distributions.
model_id target location horizon output_type output_type_id value
all_separate hosp MA 1 sample 1 3
all_separate hosp MA 1 sample 2 4
all_separate hosp MA 1 sample 3 5
all_separate hosp MA 2 sample 4 3
all_separate hosp MA 2 sample 5 4
all_separate hosp MA 2 sample 6 5
all_separate hosp CA 1 sample 7 3
all_separate hosp CA 1 sample 8 4
all_separate hosp CA 1 sample 9 5
all_separate hosp CA 2 sample 10 3
all_separate hosp CA 2 sample 11 4
all_separate hosp CA 2 sample 12 5
all_separate death MA 1 sample 13 3
all_separate death MA 1 sample 14 4
all_separate death MA 1 sample 15 5
all_separate death MA 2 sample 16 3
all_separate death MA 2 sample 17 4
all_separate death MA 2 sample 18 5
all_separate death CA 1 sample 19 3
all_separate death CA 1 sample 20 4
all_separate death CA 1 sample 21 5
all_separate death CA 2 sample 22 3
all_separate death CA 2 sample 23 4
all_separate death CA 2 sample 24 5
The advantage of this proposal is that it has a clean and interpretable data structure. A disadvantage is that it seems like it might be prone to user errors by submitting teams, and there is essentially no way to validate that submissions have the output_type_id
values, i.e. sample indexes, that accurately capture the dependence structure that is represented by their model.
If we go with this proposal, I think there would be no changes to make to the schema here, and so this would essentially become an issue for hubDocs.
I like the proposal. seems conceptually sound and I like that it doesn't add anything new to the schemas.
After thinking about this a little more, I think maybe we actually should consider making a modification to the tasks schema and implement validations slightly differently for sample output types than for the other output types.
As a motivating example, consider a hub that wants to collect up to 100 samples for a single target in each combination of 56 locations and 4 horizons. Let's also say that this hub does not want to force any particular dependence structure on the submitting models. I think this is a pretty minimal example of what this might look like for a flusight forecast hub or a covid hub reboot.
I'll first describe what a naive approach to setting this up might look like working with our existing set up, note that I think there a couple problems, and then describe my suggested approach.
Working within our current set up for tasks.json, the hub needs to specify a list of all possible/accepted sample indices as optional
values for the output_type_id
for this target. As we saw in the third example above, a model that produces separate marginal distributions at each combination of location and horizon (such as UMass-gbq
) would need to have a different sample index for each of the 56*4*100 = 22400
samples it submits. So the hub needs to list a vector of integers from 1 to 22400 in its tasks.json file:
...
"model_tasks": [{
"task_ids": {
"origin_date": {
"required": null,
"optional": ["2022-11-28", "2022-12-05", "2022-12-12"]
},
"target": {
"required": ["inc covid hosp"],
"optional": null
},
"horizon": {
"required": null,
"optional": [1, 2, 3, 4]
},
"location": {
"required": null,
"optional": ["US", "01", ..., "78"]
}
},
"output_type": {
"sample": {
"output_type_id": {
"required": null,
"optional": [1, 2, 3, 4, 5, ..., 22399, 22400]
},
"value": {
"type": "integer",
"minimum": 0
}
}
},
...
There are a couple of problems here:
expand_grid
across optional values of task id variables and output_type_id
s infeasible. That expand grid now has 56*4*22400=5017600
rows, which in a formal sense are the optionally acceptable rows in a submission, but don't really meaningfully capture what might go into a valid submission as noted in point 2.The best way around this that I have thought of so far is to introduce some specifications of expected output_type_id
s that are specialized for the sample output_type
, where instead of listing required and optional values for the output_type_id
, we specify how many samples are expected for each modeling task. I'm not exactly sure of the best way to set this up, but a couple of rough ideas are in subsections below. Note that this would require special validation logic for sample output types that is different from the validation logic for other output types. In particular, I still don't think we would want to use explicit expand_grid kind of logic.
"required"
and "optional"
entries similar to existing specificationsHere's what this might look like for a hub that wanted to accept up to 100 samples:
...
"output_type": {
"sample": {
"output_type_id": {
"required": null,
"optional": {n_samples_per_task: 100}
},
"value": {
"type": "integer",
"minimum": 0
}
}
},
...
Maybe if the hub wanted to require submission of at least 100 samples per task and up to 1000 samples per task, they would have the following?
...
"output_type": {
"sample": {
"output_type_id": {
"required": {n_samples_per_task: 100},
"optional": {n_samples_per_task: 1000}
},
"value": {
"type": "integer",
"minimum": 0
}
}
},
...
this kind of fits with our current set up but it doesn't feel very clear to me.
Here's an idea for what this might look like for a hub accepting up to 100 samples:
...
"output_type": {
"sample": {
"output_type_id": {
"min_samples_per_task": 0,
"max_samples_per_task": 100
},
"value": {
"type": "integer",
"minimum": 0
}
}
},
...
The idea here is that by setting "min_samples_per_task": 0
, we make submission of samples optional.
And for a hub accepting between 100 and 1000 samples per task:
...
"output_type": {
"sample": {
"output_type_id": {
"min_samples_per_task": 100,
"max_samples_per_task": 1000
},
"value": {
"type": "integer",
"minimum": 0
}
}
},
...
If a hub wanted to require submission of exactly 100 samples, they could set both min and max to 100.
Attempting to summarize the main points from in-person discussion:
value
s are summaries of a marginal distribution per task. Having a mix of different estimated distributions in the same output type seems potentially problematic.output_type_id
to specify dependence structure is potentially confusing, and we think it is likely that some participating teams may not understand or correctly implement the sample indexing that matches the dependence structure that is actually captured by their model. We would like to be able to validate that sample indexing in submission is correct.High level, two ideas here depending on whether or not dependence structure is enforced at the hub level.
tasks.json
config file on a per-round basis. It might be up to hub administrators whether or not they want to do this.
'samples_joint_across': ['horizon']
to say that samples should be in the form of "sample trajectories" from a joint distribution across horizons, or 'samples_joint_across': ['horizon', 'location']
to say that samples should be from a single joint distribution for all horizons and locations. The values in this vector would need to be task id variable names from that round. It's not immediately clear to me whether this would be specified at the round level, or would need to be specified within each "task id block" (i.e., each entry in the task ids array) defining a group of tasks with the same sets of possible values of all task id variables? On one hand, we know that within each round, the same task id variables are present. On the other hand, samples might be accepted model outputs only for some of the task id blocks. But it would seem to be very confusing to potentially allow for different dependence structures for different sets of targets...With either of the above, in principle we should then be able to validate that the sample indices in a model output submission file correctly match up with the joint dependence structure that is either specified by the hub or the model's config file.
The idea here is to introduce different output types that are specific to different levels of dependence that are captured in samples.
marginal_sample
, sample_marginal
, or just sample
. Note that here, the value in the output_type_id
column would essentially be ignorable, and we might as well set it to NA
.joint_sample
, that is used for any sample that is not from a marginal distribution. We would probably still need the stuff in the metadata config files described under the previous theme to further specify what kind of dependence structure is captured.trajectory_sample
or sample_trajectory
. This would avoid the need to specify dependence structures in metadata files, but has the disadvantage that we would need to introduce new output types for each new type of dependence structure that a hub wants to collect.Just a small comment that I don't think we need to force the sample id to be numeric as we don't expect to perform any numeric operation on them - it may be in many cases be numeric for convenience but I think should be allowed to be any string that links rows in the table.
follow-up questions/thoughts about the data type for sample ids:
Trying to get a little more concrete about what it would look like to specify dependence structures in config files, to support discussion/decision-making about whether or not we want to go that route. From comments above, goals are to allow to specify this:
I'll illustrate with an example hub that wants to collect up to 100 samples for two targets (hospitalizations, deaths) in each combination of 56 locations and 4 horizons. An excerpt of this hub's tasks.json
config file might look like this:
...
"rounds": [{
"round_id_from_variable": true,
"round_id": "origin_date",
"model_tasks": [{
"task_ids": {
"origin_date": {
"required": null,
"optional": ["2022-11-28", "2022-12-05", "2022-12-12"]
},
"target": {
"required": ["hosps", "deaths"],
"optional": null
},
"horizon": {
"required": null,
"optional": [1, 2, 3, 4]
},
"location": {
"required": null,
"optional": ["US", "01", ..., "78"]
}
},
"output_type": {
"sample": {
"output_type_id": {
"min_samples_per_task": 1,
"max_samples_per_task": 100
},
"value": {
"type": "integer",
"minimum": 0
}
}
}
}]
}],
...
The proposal is to introduce a new field that is an attribute of the round object, at the same level as "round_id"
and "model_tasks"
, that says what the expected dependence structure is for sample submissions for that round. This has the form of an array of names of task id variables, specifying that models should produce samples from a single joint distribution across all values of those task ids. For instance, to specify that samples are expected to be "trajectories" across forecast horizons, the config would look like:
"rounds": [{
"round_id_from_variable": true,
"round_id": "origin_date",
"model_tasks": [{
...
}],
"samples_joint_across": ["horizon"]
}],
...
To illustrate a less common use case -- if samples are expected to be from a joint distribution across all horizons and locations (e.g. to support aggregation from the state level to the regional level), the configuration would look like:
"rounds": [{
"round_id_from_variable": true,
"round_id": "origin_date",
"model_tasks": [{
...
}],
"samples_joint_across": ["horizon", "location"]
}],
A few notes:
samples_joint_across
."samples_joint_across": []
? And if they don't want to enforce any particular dependence structure for samples, they could set "samples_joint_across": null
?samples_joint_across
field in their round config. If a hub does collect samples in a given round, should this field be required, or we just infer a missing samples_joint_across
field to be null
?"samples_joint_across": ["horizon"]
) for the hosps
target but independent samples ("samples_joint_across": []
) for the deaths
target. I think that this is not a major limitation and it could potentially be confusing to allow for more flexiblity.According to the proposal above, if a hub wants to allow models to submit samples without specifying the desired dependence structure, they could specify "samples_joint_across": null
. In that case, we might like to have a way to validate that the sample indices in a model output submission accurately represent the dependence structure that a model claims to capture. To enable that validation, we would need to collect some metadata from the model saying what joint distribution(s) its samples come from. One possibility is that we could collect this in the model's metadata file.
Here's an example of what this might look like (not sure i'm getting my yaml set up right):
team1-modela.yml
:
model_id: team1-modela
model_name: Example model
samples_joint_across: horizon, location
...or if this is round-specific:
model_id: team1-modela
model_name: Example model
samples_joint_across:
round1: horizon
round2: horizon
round3: horizon, location
I don't love the idea of asking teams to update their model metadata file each round with this kind of round-specific information -- we've previously said that any round-specific information would be in their free-form abstract. But I'm not sure there's a way around this if we want to be able to validate correctness of the sample indices in model output submissions in cases where the hub does not specify what dependence structure to use.
suggestion to include this as an attribute of the sample
"output_type": {
"sample": {
"output_type_id": {
"min_samples_per_task": 1,
"max_samples_per_task": 100
"samples_joint_across": ["horizon"],
},
"value": {
"type": "integer",
"minimum": 0
}
}
}
Summing up discussion and decisions from today's call:
hubUtils::connect_hub
has an argument to control data type for this column anyways (though for purposes of validating submissions we’d still want to be clear about types)output_type
, "sample"
. We will not introduce other output types like "sample_trajectory"
.samples_joint_across
as a property of output_type_id
. we will also add support for models to specify sample dependence structure in their model metadata YAML file. note that models should only be providing this metadata if the dependence structure is not specified by a hub.we will also add support for models to specify sample dependence structure in their model metadata YAML file. note that models should only be providing this metadata if the dependence structure is not specified by a hub.
I have a small question about that part, how would you enter the information for a model without sample dependence structure (each id is unique in every row), should we use NA
or NULL
?
I mean something like:
model_id: team1-modela
model_name: Example model
samples_joint_across:
round1: horizon
round2: horizon, location
round3: NA
it seems like it may be possible to use []
to specify an empty array? https://stackoverflow.com/questions/5110313/how-do-i-create-an-empty-array-in-yaml
Thanks! That makes sense and would match the json file format.
We have been discussing implementing this sample framework for an upcoming SMH round, and a few questions/comments have arisen through these discussions that I think might be useful to consider. Here's my attempt to summarize:
output_type_id
for each independent sample. However, from the perspective of the teams, this may be more complicated to implement. Imagine a team makes predictions for each location and target independently. Then, under our current framework, the output should look like this (following the example above from @elray1).
model_id target location horizon output_type output_type_id value
horizon_joint hosp MA 1 sample 1 3
horizon_joint hosp MA 1 sample 2 4
horizon_joint hosp MA 1 sample 3 5
horizon_joint hosp MA 2 sample 1 3
horizon_joint hosp MA 2 sample 2 4
horizon_joint hosp MA 2 sample 3 5
horizon_joint hosp CA 1 sample 4 3
horizon_joint hosp CA 1 sample 5 4
horizon_joint hosp CA 1 sample 6 5
horizon_joint hosp CA 2 sample 4 3
horizon_joint hosp CA 2 sample 5 4
horizon_joint hosp CA 2 sample 6 5
horizon_joint death MA 1 sample 7 3
horizon_joint death MA 1 sample 8 4
horizon_joint death MA 1 sample 9 5
horizon_joint death MA 2 sample 7 3
horizon_joint death MA 2 sample 8 4
horizon_joint death MA 2 sample 9 5
horizon_joint death CA 1 sample 10 3
horizon_joint death CA 1 sample 11 4
horizon_joint death CA 1 sample 12 5
horizon_joint death CA 2 sample 10 3
horizon_joint death CA 2 sample 11 4
horizon_joint death CA 2 sample 12 5
Presumably, when a team generates these samples, they have samples 1:n for each unique target/location and would have to perform a second step to transform output_type_id
to the unique values we are asking for (e.g., in the above example, samples 1,2,3 for CA hosp would become 4,5,6). This seems to (1) put more burden on teams, and (2) create more possibilities for error.
Possible alternative: Since we now have information about dependence structure in the metadata, can we perform this transformation step in post-processing? In this instance, all teams, regardless of model dependence structure, would have the same number of samples (i.e., all submissions would look like the example below), and the assumption would be that samples are independent when from task_id
variables that are not included in samples_joint_across
metadata field.
I see downfalls with this approach too (e.g., information is not included in submission directly), but it might be worth discussing further.
model_id target location horizon output_type output_type_id value
all_structures hosp MA 1 sample 1 3
all_structures hosp MA 1 sample 2 4
all_structures hosp MA 1 sample 3 5
all_structures hosp MA 2 sample 1 3
all_structures hosp MA 2 sample 2 4
all_structures hosp MA 2 sample 3 5
all_structures hosp CA 1 sample 1 3
all_structures hosp CA 1 sample 2 4
all_structures hosp CA 1 sample 3 5
all_structures hosp CA 2 sample 1 3
all_structures hosp CA 2 sample 2 4
all_structures hosp CA 2 sample 3 5
all_structures death MA 1 sample 1 3
all_structures death MA 1 sample 2 4
all_structures death MA 1 sample 3 5
all_structures death MA 2 sample 1 3
all_structures death MA 2 sample 2 4
all_structures death MA 2 sample 3 5
all_structures death CA 1 sample 1 3
all_structures death CA 1 sample 2 4
all_structures death CA 1 sample 3 5
all_structures death CA 2 sample 1 3
all_structures death CA 2 sample 2 4
all_structures death CA 2 sample 3 5
parameter_set
column), but it is currently not clear what the output_type_id
column is describing (i.e., is parameter_set
the correct column to add, or should it be stochastic_run
?).
model_id target location horizon output_type parameter_set output_type_id value
multiple_variation hosp MA 1 sample 1 1 3
multiple_variation hosp MA 1 sample 1 2 4
multiple_variation hosp MA 1 sample 1 3 5
multiple_variation hosp MA 1 sample 2 1 3
multiple_variation hosp MA 1 sample 2 2 4
multiple_variation hosp MA 1 sample 2 3 5
This comment by @eahowerton is close to what I would have given as an answer to @elray1's question above:
I am having a hard time imagining a situation where it's really important, or even desirable, to allow to store some information in the contents of a string sample id?
Given that we don't need it to be a number it seems to me that allowing more flexibility here potentially is of benefit to the hubs. E.g. in the example above samples could be called anything that teams use to draws from a joint distribution, e.g. a team could have
model_id target location horizon output_type output_type_id value
horizon_joint hosp MA 1 sample MA_1 3
horizon_joint hosp MA 1 sample MA_2 4
horizon_joint hosp MA 1 sample MA_3 5
horizon_joint hosp CA 1 sample CA_1 3
horizon_joint hosp CA 1 sample CA_2 4
horizon_joint hosp CA 1 sample CA_3 5
if these are independent by location or
model_id target location horizon output_type output_type_id value
horizon_joint hosp MA 1 sample 1 3
horizon_joint hosp MA 1 sample 2 4
horizon_joint hosp MA 1 sample 3 5
horizon_joint hosp CA 1 sample 1 3
horizon_joint hosp CA 1 sample 2 4
horizon_joint hosp CA 1 sample 3 5
if they are joint by location, or they could use the output_type_id value to indicate where samples are from different parameter sets (p1
, p2
, ...) or stochasticity (s1
, s2
, ...). A common output_type_id
would then always indicate that samples come from a joint distribution, i.e. combine multiple rows into a unit of observation - this could also be used in a check when validating against the config file.
If we're being very explicit in the config file as suggested above then this may not be necessary but I imagine we may not be able to cater for all possible use cases so it's a question of how prescriptive we want to be here vs. allowing some flexibility - and whether we're worried about creating confusion if common output_type_id
values sometimes indicate draws from a joint distribution and sometimes not, depending on the config file.
Good questions and ideas. Here are a few thoughts:
output_type_id
column of your submission were shared across locations. Please update your submission to use different sample indices for different locations."output_type_id
s are different in different locations. (I'd note that from here, it's not hard to get to distinct integers e.g. in R via as.integer(factor(...))
-- but I do see that this discards some information -- see next point)output_type_id
. In Seb's notation, the output_type_id
might then have values like "p1_s1"
, "p1_s2"
, "p2_s3"
. Or if these are independent across locations, "MA_p1_s1"
and "CA_p1_s1"
. If we want this information to be usable, I think we would need to validate that these output_type_id
values were consistently and correctly formatted across models at submission time, so that if the hub wants to they can do some string parsing to extract the parameter set and stochastic replicate indices. So maybe the hub config would have to provide regex that's used to validate the sample indices upon submission? Where if the hub doesn't care about storing information in the output_type_id
column, they could just specify a regex that matches any string up to some maximum length.parameter_set
and stochastic_replicate
columns, which determine something like an output_type_id
in combination. The hub config would have to specify something like what columns put together specify an output_type_id
, and what valid combinations of values across those columns look like. This could be kind of like what we've set up for target keys.output_type_ids
and specify a regex saying how they want information to be represented in those strings?Summing up the main points from discussion about this today:
output_type_id
column, with a goal that this may have either integer or character values for the sample output type. We will not provide a regex validation for information in strings, and we will not handle validating other columns with this information. The validations that will be provided by default will be minimal checks that. sample indices are consistent with the dependence structure specified by the hub or the model's metadata file:
output_type_id
) should not appear in rows with different values for any task id variable that is not listed in a samples_joint_across
field in the config or metadata file. (or, applying De Morgan's law: all sample indices should be distinct in rows with different values for any task id variable that is not listed in a samples_joint_across
field). Error if this condition is not met.samples_joint_across
field. I'm not sure if we want an error or maybe just a warning/informational message if this condition is not met? To-do to figure out how careful to be in this check (e.g. if samples_joint_across
includes "horizon"
, maybe we should expect any sample indices that appear for one horizon to also appear at all other horizons, for fixed values of the non-joint-across task id variables)?parameter_set
and stochastic_replicate
. Hubverse default tooling would not validate contents of these columns, but a hub could add a custom validation step if desired. That said, we might need to get something into the schema config for the hub that says "these extra columns are allowed/should not result in an error".parameter_set
just a task id?This might depend on how a hub is thinking about parameter_set
, but my opinion is that in general, the parameter set is not really a task id variable so much as a piece of metadata about how a sample was generated. Some considerations related to this:
parameter_set
as a task id variable, with 100 possible values for the parameter set, we could end up collecting::
Maybe an exception to these points is if a hub assigns some meaning to each parameter set that is agreed upon by all participating models. This would turn a parameter set into something more like a scenario, which does feel like a task id.
In this comment above, @elray1 said that the decision was made to
add samples_joint_across as a property of output_type_id. we will also add support for models to specify sample dependence structure in their model metadata YAML file. note that models should only be providing this metadata if the dependence structure is not specified by a hub.
However, I am concerned that this does not leave room to specify that a model submission has a different dependence structure than what is specified by the hub. My proposal (in discussion with @bsweger ) is that we have the hub specification in the tasks-config file as the "default" dependency specification and that a model could indicate that they had a different specification by using the model-metadata specification as proposed here.
One question is whether hub admins should have the ability to allow such an "override" of default specifications. Our feeling is that for now we should keep it simple and NOT include an additional hub-specific config field about this, but that hub admins could use criteria about how samples are generated as eligibility criteria for inclusion in an ensemble. E.g. "to be eligible for inclusion in an ensemble, a model's submission in a given round must at least be joint across horizons, and could optionally be joint across locations." or something like that.
Seems reasonable in general, except that I think it would be quite reasonable for a hub to:
I could see myself being in camps 1 and 2 in future hub administrative efforts, so i would like to support people like future me
To add some examples, currently on US SMH, we are implementing a new format to track dependence or what we call "pairing" or "grouping" information.
We have a wiki with the submission file format documentation, here
We store the expected minimal level of dependence expected by the hub in the tasks.json, for example:
"output_type":{
"sample":{
"output_type_id":{
"min_samples_per_task": 100,
"max_samples_per_task": 100,
"samples_joint_across": ["horizon", "age_group"]
},
"value":{
"type":"double",
"minimum":0
}
}
},
During the submission, the validation is automatically checking the minimal level but does not register if a team add another level of dependence (for example if their submission is joint across horizon
, age_group
and location
, the submission will be accepted, but if it's joint only across horizon
it will not be accepted).
We are currently not asking the team to provide their "grouping" information in the metadata. It's relatively easy to extract from the submission file. However, I can see that some hubs might want to have that information provided by the team, if you have a lot of team participating and weekly submission.
In response to @elray1 's comment:
I could see myself being in camps 1 and 2 in future hub administrative efforts, so i would like to support people like future me
I agree that a hub might want to do this, but wouldn't a simpler development approach (at least for now) be to have hubs do this kind of validation on their own. E.g., from @LucieContamin 's comment it sounds like SMH is adding some custom validations (which could potentially be folded in to future central hubverse efforts).
So the process/documentation for hubverse tooling would remain in the near-term that
samples_joint_across
Probably worth a weigh-in from anna, but i think the validation step in item 3 on this list might be among the easier things to implement in this whole set up, and not particularly more challenging or involved than item 2. e.g., I could imagine addressing 2 and 3 simultaneously by introducing a pair of booleans along the lines of tasks[[round_id]][[task_group_index]].sample_dependence_subsets_allowed
and tasks[[round_id]][[task_group_index]].sample_dependence_supersets_allowed
, which if set to false
trigger checks that the model's specified dependence structure in a given round is not a strict sub/superset of the hub's specified dependence structure respectively. if it's that easy, I don't really see a reason to hold off on support for this, particularly given that we have an existing hub where this is the desired behavior. Of course we can/should split this into a separate issue/to-do when the time comes and prioritize it as appropriate; it just doesn't seem particularly more cumbersome or challenging to address to me. If I'm wrong about this, happy to defer or delay.
That is a fair point. I'm also concerned about conceptual complexity for new/existing/future users in addition to the complexity of implementation and/or maintenance and support of new metadata fields. But I agree that a compelling case to build it is if a current hub could use it and needs it now.
concern about conceptual clarity is very legitimate.
my take, though, is that implementation, maintenance, and user onboarding will all be smoother if we decide what we want once and do it up front. so i would vote to either:
Just adding another minor reflection from the SMH experience. Communicating and implementing the plan to collect dependence structure information was a real challenge. Defining and standardizing "dependence" across different modeling frameworks was not as straightforward as it originally seemed.
Overall, I'd say it has turned out to be a fruitful endeavor, and I can see clear reasons to support it more generally. And perhaps, the more we support/invest in this, the more standard it will become. But at the same time, I agree with @nickreich's concern about introducing additional conceptual complexity and tackling the associated communication challenges, especially when it's not clear how many hubs would use this functionality.
That is a reasonable point, especially since the way SMH implemented it (if I understand correctly, based on the link Lucie provided above), the dependence ended up being specified in additional non-hubverse-compliant columns and in ways that talked about "grouping by" rather than "joint across" which are related but kind of complementary concepts...
I'm thinking it might help also if we get specific about what we're trying to do here and consider creating more/separate fields specific to those purposes. It seems like we have thought of 2 kinds of validations we might like to do on model output submissions:
Additionally, Nick has brought up a third thing that could be nice -- we might like to:
Thinking about 1 and 3, I wonder if we could capture the ideas more clearly with tasks.json
fields like:
samples_joint_across_default
samples_joint_across_minimal
Note, I'm omitting a field like samples_joint_across_maximal
because it seems like in general a hub would not want to reject samples that met their minimal dependence needs and also captured dependence across other things like locations or age groups.
For example, a hub that prefers modelers to submit trajectories, joint across target_date
, but will take any samples they can get, might specify
samples_joint_across_default: ["target_date"]
samples_joint_across_minimal: []
(or null
or whatever appropriate convention)Then our validations process could involve checks like this:
samples_joint_across
in their model metadata file? If so, verify that what they put there contains everything in samples_joint_across_minimal
.samples_joint_across
specification given by the model metadata file if provided there, or the hub/round/target default if nothing was provided by the modeler. Check that the sample indexing in the submission file is consistent with that.Noting that the above are not actually the checks that were done by SMH. My understanding is that they:
If we just went that way, which is simpler, we would not bother with allowing models to put stuff about samples_joint_across
in their model metadata files, and we would not need samples_joint_across_default
. In the SMH example, all that would have been needed was a setting like samples_joint_across_minimal: ["horizon", "age_group"]
.
Then a hub like the one Nick mentioned that would really like samples to be joint across target_date
but will take what they can get would just specify samples_joint_across_minimal: []
in their hub config files and would write some human-readable note to their contributing teams about how they really hope teams will submit trajectories.
What we would lose with this approach is the ability to check that an output submission with sample indexing implying dependence is captured across dates, locations, and age groups was set up correctly by the contributing team.
After spending some time iterating with @bsweger on this for thinking about setting up a variant hub, and in discussion with @nikosbosse about related ideas in scoringutils, I wonder if maybe introducing different terminology/concepts might be useful. In scoring utils, they have the concept of a "forecast unit" which is equivalent to the hubverse idea of a unique combination of "task id variables". That is, the rows that share one unique set of task id variables define a single "forecast unit". And that forecast unit can have a single observed value for its target data.
I find the concept of "joint across" to be jargony, and actually a non-trivial statistical concept (i.e., well beyond intro stats). So I've been trying to think about ways that we could represent and document these ideas in ways that both speak to a statistically literate modeling audience and a data-literate but "stats-naive" hub admin/dev audience.
As a concrete example, here is a table showing 3 separate forecast units, where the task id variables are "origin_date", "horizon", and "location". There are 9 rows, as for each of the 3 forecast units we have three samples. (I've left the data values as just "-" for now.
origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-03-15 | -1 | MA | sample | 1 | - |
2024-03-15 | 0 | MA | sample | 2 | - |
2024-03-15 | 1 | MA | sample | 3 | - |
2024-03-15 | -1 | MA | sample | 4 | - |
2024-03-15 | 0 | MA | sample | 5 | - |
2024-03-15 | 1 | MA | sample | 6 | - |
2024-03-15 | -1 | MA | sample | 7 | - |
2024-03-15 | 0 | MA | sample | 8 | - |
2024-03-15 | 1 | MA | sample | 9 | - |
In any sample data, you could never have two rows of data with the same output_type_id that share the same set of values for all task-id variables. In the table above, each of the samples is independently drawn, meaning that the samples are "joint across" nothing, or [ ]
.
I think a fundamental idea at the core of this discussion about samples is that we are changing the forecast unit. That is to say, that now we are saying that a forecast unit could be comprised of multiple unique unique combinations of task-id variables. For example, in another version of the table above, we might say that for a given location and origin_date, one sample corresponds to a grouping of three unique task-ids that share the same location and origin-date but not the same horizon. So in this case, there are only three "forecast units" or independent samples, not nine as there were above.
origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-03-15 | -1 | MA | sample | 1 | - |
2024-03-15 | 0 | MA | sample | 1 | - |
2024-03-15 | 1 | MA | sample | 1 | - |
2024-03-15 | -1 | MA | sample | 2 | - |
2024-03-15 | 0 | MA | sample | 2 | - |
2024-03-15 | 1 | MA | sample | 2 | - |
2024-03-15 | -1 | MA | sample | 3 | - |
2024-03-15 | 0 | MA | sample | 3 | - |
2024-03-15 | 1 | MA | sample | 3 | - |
In this case, the samples are "joint across horizon". Note that this does not violate the above rule: no two rows with the same output_type_id
share exactly the same values for task-id variables. Instead, the output_type_id values group the rows into sets of task-id variables that are "connected" by the model, or that the model is "joint across".
Here is another example with two locations where data are joint across horizon:
origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-03-15 | -1 | MA | sample | 1 | - |
2024-03-15 | 0 | MA | sample | 1 | - |
2024-03-15 | 1 | MA | sample | 1 | - |
2024-03-15 | -1 | MA | sample | 2 | - |
2024-03-15 | 0 | MA | sample | 2 | - |
2024-03-15 | 1 | MA | sample | 2 | - |
2024-03-15 | -1 | TX | sample | 3 | - |
2024-03-15 | 0 | TX | sample | 3 | - |
2024-03-15 | 1 | TX | sample | 3 | - |
2024-03-15 | -1 | TX | sample | 4 | - |
2024-03-15 | 0 | TX | sample | 4 | - |
2024-03-15 | 1 | TX | sample | 4 | - |
And another where they are joint across horizon AND location. This means that when output_type_id==1
, all of those six rows are taken from the same model realization, where three steps ahead in both MA and TX share some information within the model:
origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-03-15 | -1 | MA | sample | 1 | - |
2024-03-15 | 0 | MA | sample | 1 | - |
2024-03-15 | 1 | MA | sample | 1 | - |
2024-03-15 | -1 | MA | sample | 2 | - |
2024-03-15 | 0 | MA | sample | 2 | - |
2024-03-15 | 1 | MA | sample | 2 | - |
2024-03-15 | -1 | TX | sample | 1 | - |
2024-03-15 | 0 | TX | sample | 1 | - |
2024-03-15 | 1 | TX | sample | 1 | - |
2024-03-15 | -1 | TX | sample | 2 | - |
2024-03-15 | 0 | TX | sample | 2 | - |
2024-03-15 | 1 | TX | sample | 2 | - |
Maybe another way to frame this in terms of validation or how to reverse engineer the above datasets is that when something is specified as "joint across variables x and y" that means that for all rows that share an output_type_id, those rows are expected to have multiple unique values of x and y (maybe all unique values of x and y present in any "sample" row of the submitted data?). I've actually been struggling to come up with a specific set of validations for these data.
So to return to nomenclature, I wonder if sample_grouping
might be more a more clear term than samples_joint_across
, where the values of the accompanying array would stay the same? E.g.
sample_grouping
(or sample_task_grouping
?) could be a field in tasks.json
> rounds
> model_tasks
> output_type
> sample
> output_type_id
min_..
and max_samples_per_task_group
would replace the current min_..
and max_samples_per_task
So the tasks.json
might look like this:
"output_type":{
"sample":{
"output_type_id":{
"min_samples_per_task_group": 100,
"max_samples_per_task_group": 100,
"sample_grouping_minimal": ["horizon"],
"sample_grouping_default": ["horizon", "location"]
},
"value":{
"type":"double",
"minimum":0
}
}
}
The above configuration would trigger the following checks on any submitted files that are assumed to have the default setting:
output_type_id
, the combination of values in the columns specified by sample_grouping
must be unique (i.e. not duplicated) in those rowsHopefully this long treatise doesn't muddy the waters further.
I'm also wondering if this concept of "task grouping" might be related to other concepts in hubverse, like for compositional data, where we might want a "variant" task-id column to refer to a unique grouping of rows whose values we expect to sum to 1.
@nick Thanks for calling out the "data-literate/stats-naive" category of hubverse user here.
In that vein, a follow-up question. When looking at your original example table (the first table in this comment, with 9 unique output_type_id
s), how would someone know that it represents 3 samples? It seems like the sample number is a key component of understanding how output_type_id
uniqueness works, but it's implicit rather than explicit.
Is that a correct characterization, or am I missing something obvious?
I agree with Nick's criticisms of the joint_across
terminology.
When I started writing this comment I was on board with a "task grouping" terminology. But then I tried to use it to answer Becky's question just above, and realized that actually it impacts grouping (in order to split a df of model outputs into groups corresponding to a modeled observational unit) in exactly the opposite way that you might expect from that name -- it describes the variables that you would remove from a group_by
operation in order to divide rows up into the observational units. In Nick's first example, with a separate output_type_id
for each row, if you do df |> group_by(origin_date, horizon, location, output_type, output_type_id)
, each group in the result will contain one sample for one observational unit. In Nick's second example, "one sample corresponds to a grouping of three unique task-ids that share the same location and origin-date but not the same horizon." That means that if we want a separate group for each sample an observational unit, we should not group by the horizon:df |> group_by(origin_date, location, output_type, output_type_id)
. Note that this is actually exactly what Nick's text described, "a grouping of three unique task-ids that share the same location and origin-date", i.e., group by location and origin date. So "sample_grouping_minimal": ["horizon"]
might be a confusing term to use in this setting, because "horizon"
is "the thing we have to leave out in order for what's left over to describe an observational unit"...
Unfortunately, I can't think of a better name immediately, though I will keep thinking about it... I do think it's reasonable to think of those samples as "going in one group" together, this just interacts oddly with the group
ing operation you'd want to do with a data frame of model outputs... Maybe the problem is that the word "group" is too flexible -- group what, for what purpose?
r.e. "I'm also wondering if this concept of "task grouping" might be related to other concepts in hubverse, like for compositional data, where we might want a "variant" task-id column to refer to a unique grouping of rows whose values we expect to sum to 1." -- this makes sense, but also seems potentially complex. e.g., what if I collect trajectories for variant proportions? Then I have task groupings by variant (sum-to-1) and also by variant+target date (trajectories tracking variants). Is it more helpful than not to keep track of these things? Maybe. Do we need to keep track of those things? If so, are there any other alternatives for how we might do so?
(Becky, I think a more complete/direct response to your question might be that in the first example, if you group by all of the task id variables origin_date
, horizon
and location
, there are 3 rows in each group.)
(Becky, I think a more complete/direct response to your question might be that in the first example, if you group by all of the task id variables origin_date, horizon and location, there are 3 rows in each group.)
How would we know the number of samples in the last example of that comment?
Is it still 3?
Agree with @elray1 that the verb "group" is complicated here and may be too loaded to use.
Throwing some additional miscellaneous thoughts in here about appropriate terms:
sample_taskid_set
with min/max_samples_per_taskid_set
, sample_taskid_set_minimal
and sample_taskid_set_default
as the configuration fields?To @bsweger 's question:
How would we know the number of samples in the last example...? The number of samples could always be determined by the number of times that each unique combination of
task-id
variables appear in the submission. So for the last example it would actually just be 2 samples since each task-id-set (e.g., one set isorigin_date == "2024-03-15" & horizon == -1 & location == "MA"
) appears exactly twice in the provided data.
I am a bit unclear on the meaning behind: samples_joint_across_default
and samples_joint_across_minimal
. As I understand, default
is the expected format but the hub will accept other levels, with at minimal samples_joint_across_minimal
, is that correct? Then, in this case, why don't we just use one level of specification for the dependence?
For the terminologies, I agree too but I don't have any other ideas. I like the "set" idea.
r.e. the meaning behind samples_joint_across_default
, I think this field is mainly/only useful in a hub that:
Chiming in with a +1 for set
. That word is well understood by data practitioners beyond the realm of statistics.
The reason I keep asking about the number of samples...wouldn't you need to know that in order to reason about the contents of output_type_id
?
sample_num | origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|---|
1 | 2024-03-15 | -1 | MA | sample | 1 | - |
1 | 2024-03-15 | 0 | MA | sample | 1 | - |
1 | 2024-03-15 | 1 | MA | sample | 1 | - |
2 | 2024-03-15 | -1 | MA | sample | 2 | - |
2 | 2024-03-15 | 0 | MA | sample | 2 | - |
2 | 2024-03-15 | 1 | MA | sample | 2 | - |
1 | 2024-03-15 | -1 | TX | sample | 3 | - |
1 | 2024-03-15 | 0 | TX | sample | 3 | - |
1 | 2024-03-15 | 1 | TX | sample | 3 | - |
2 | 2024-03-15 | -1 | TX | sample | 4 | - |
2 | 2024-03-15 | 0 | TX | sample | 4 | - |
2 | 2024-03-15 | 1 | TX | sample | 4 | - |
If you add the sample number to @nickreich's two locations where data are joint across horizon
example, you can see the info through the lens of relational database theory.
{'origin_date', 'horizon', 'location'}
{'origin_date', 'location'}
i.e., joint_across
can be inferred as the set of forecast units - the compound forecast unit (horizon in this case)
I think we can then say that output_type_id
has a functional dependency on sample_num + all columns in the compound forecast unit.
I assume that we're not asking hub participants to submit a sample_num as part of their model-outputs, but is it correct to assume that it's something we'd need to derive as part of the validation process? I can't reason about these examples without adding it.
Thanks @elray1 for the additional information, sorry to insist but just to be sure I understand. The idea here is to use:
samples_joint_across_default
/sample_grouping_default
: "preferred" dependence structure but accepted others. If a team submits projections without "join across/grouping" information in their metadata file than this structure is expected and will be use for validationsamples_joint_across_minimal
/sample_grouping_minimal
: minimal dependence structure accepted. If a Hub accept any level of dependence, it will be set to null
or []
(whatever we decide). So, in the case where both (default and minimal) are set to the same value, then it will be expected that the submissions have at least the default "preferred" dependence structure. Is that correct?
In that case:
horizon
and "sub-group" of location
(group 1: MA, TX; group 2: CA, FL):origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-03-15 | -1 | MA | sample | 1 | - |
2024-03-15 | 0 | MA | sample | 1 | - |
2024-03-15 | -1 | MA | sample | 2 | - |
2024-03-15 | 0 | MA | sample | 2 | - |
2024-03-15 | -1 | TX | sample | 1 | - |
2024-03-15 | 0 | TX | sample | 1 | - |
2024-03-15 | -1 | TX | sample | 2 | - |
2024-03-15 | 0 | TX | sample | 2 | - |
2024-03-15 | -1 | CA | sample | 3 | - |
2024-03-15 | 0 | CA | sample | 3 | - |
2024-03-15 | -1 | CA | sample | 4 | - |
2024-03-15 | 0 | CA | sample | 4 | - |
2024-03-15 | -1 | FL | sample | 3 | - |
2024-03-15 | 0 | FL | sample | 3 | - |
2024-03-15 | -1 | FL | sample | 4 | - |
2024-03-15 | 0 | FL | sample | 4 | - |
Sorry again if I am missing anything.
@bsweger , I don't think we want to add a column for the sample_num
, I don't know if it's already integrated in the validation or not, but in the SMH validation, we test the number of samples by calculating the number of repetition of unique set of projection units/task id columns set. We force that all the unique "set" should have the same number of samples. For example, if a team provide 100 samples, each task id set is repeated 100 times. Does that answer your question?
r.e. Lucie's questions, my answers would be:
So, in the case where both (default and minimal) are set to the same value, then it will be expected that the submissions have at least the default "preferred" dependence structure. Is that correct?
Yes
does that mean we allow teams to have different "level" of information in their metadata? and just a note, I think should we store that information per round too (in the metadata).
If I understand the question, yes - depending on whether they match the default or not, different teams might have different stuff that they need to put into their metadata files. This is a bit messy.
Agreed this information is per-round (and potentially even per-target-group). Is it reasonable to expect modelers to get this metadata into their model metadata files?? I'm not sure.
and does that mean that we are going to validate the team dependence structure against either the default or what the team provide and also that is contains the minimal dependence structure, not just that it contains the minimal expected dependence
Yes
last question, I am not sure we need/want to include that but just in case: how do we represent/validation complex dependence structure, for example, we can imagine the following example, where the samples are grouped by horizon and "sub-group" of location (group 1: MA, TX; group 2: CA, FL):
I see two options:
samples_joint_across
(whatever we end up calling it) is "horizon"
, but not if it is ["horizon", "location"]
because those samples are not joint across all locations. But probably a good idea to issue a warning or message in this kind of setting...I'm not sure which of those two options I prefer, they both seem wrong.
I think another very reasonable direction to go could be more like "Idea 2" in my comment above, more like my understanding of what SMH is doing: forget about the "default" value for this, and forget about allowing modelers to submit per-round or per-target-group metadata about their dependence structure. Validate only that the dependence structure implied by their sample indices captures at least as much as the minimal requirements of the hub. This loses the ability to really carefully check that the modelers are doing indexing in a way that reflects what they want. But capturing the metadata required to enable that validation may be out of reach anyways, in terms of the complexity of what we'd be asking modelers to provide. And maybe instead of that formal validation we could just provide messages in the validation output describing the dependence structure we've identified based on the submitted sample indices.
I like this option ("Idea 2"). I had the same idea when we implement it in SMH, to have the validation returns the dependency structure information. I did not had the time to implement it yet.
However as you say, the Idea 2 has the limit to not being able to really check the modelers indexing or to force a dependence structure on all the submissions.
So maybe a hybrid of 1 and 2 might help:
samples_joint_across
/sample_grouping
: dependence structure accepted horizon
and they do or do not accept additional "grouping"I also like the idea of having the information of the dependence structure stored somewhere because it's not easy to find that information just by looking at the file, so I think it's a good idea to store the information in the metadata. Either at hub level if we want to only have the minimal dependence structure information or only one structure accepted, or at team level if we want more information. If the team provides it then we can use it for validation I guess.
It did not solve the complex dependence structure issue, I prefer the option 1, with a warning than 2 but maybe other hubs will prefer to throw an error.
For anyone following this issue, I am working on writing up a complete proposal that tries to address the issues brought up here on this incredibly long thread...
I'm going to close this, as superceded by #70 .
There is some related discussion here: https://hubdocs.readthedocs.io/en/latest/format/model-output.html#formats-of-model-output
Reproducing the relevant part:
Should we provide a way for hubs to specify any desired dependence structure for sample outputs in their metadata?