hubverse-org / schemas

JSON schemas for modeling hubs

Creative Commons Zero v1.0 Universal

4 stars 2 forks source link

consider formalizing dependence structure for sample outputs #48

Closed elray1 closed 3 months ago

elray1 commented 1 year ago

There is some related discussion here: https://hubdocs.readthedocs.io/en/latest/format/model-output.html#formats-of-model-output

Reproducing the relevant part:

We emphasize that the mean, median, quantile, cdf, and pmf representations all summarize the marginal predictive distribution for a single combination of model task id variables. On the other hand, the sample representation may capture dependence across combinations of multiple model task id variables by recording samples from a joint predictive distribution. For example, suppose that the model task id variables are “forecast date”, “location” and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task id variables, including:

the joint predictive distribution across all locations and horizons within each forecast date

the joint predictive distribution across all horizons within each forecast date and location

the joint predictive distribution across all locations within each forecast date and horizon

the marginal predictive distribution for each combination of forecast date, location, and horizon

Hubs should specify the collection of task id variables for which samples are expected to capture dependence; e.g., the first option listed above might specify that samples should be drawn from distributions that are “joint across” locations and horizons.

Should we provide a way for hubs to specify any desired dependence structure for sample outputs in their metadata?

elray1 commented 1 year ago

It has also been noted that within a hub, models might have different capacities for what dependence structure they are able to capture, so we should potentially allow for a per-model specification of this.

This might also differ across rounds.

annakrystalli commented 12 months ago

Quick Q @elray1 , this isn't something you want to think about for v2.0.0 right?

elray1 commented 12 months ago

yeah, i think we're too far from knowing exactly what we want to do here.

elray1 commented 12 months ago

It may be simplest to leave this to model submission files rather than trying to encode this in a separate per-model or per-round configuration file. We could say that within one model output submission file, i.e. per each submission round, any rows with output_type = "sample" and the same output_type_id, i.e. the same sample index, are assumed to be from a single draw from a common joint distribution.

For example, consider a forecast hub with the following task ids:

target: hosp or death
location: two locations, MA and CA
horizon: 1, 2 weeks

A model that estimates a single joint distribution for all horizons, both locations, and both targets might submit something like the following to represent a collection of 3 samples from that joint distribution:

model_id   target  location  horizon  output_type  output_type_id  value
all_joint  hosp    MA        1        sample       1               3
all_joint  hosp    MA        1        sample       2               4
all_joint  hosp    MA        1        sample       3               5
all_joint  hosp    MA        2        sample       1               3
all_joint  hosp    MA        2        sample       2               4
all_joint  hosp    MA        2        sample       3               5
all_joint  hosp    CA        1        sample       1               3
all_joint  hosp    CA        1        sample       2               4
all_joint  hosp    CA        1        sample       3               5
all_joint  hosp    CA        2        sample       1               3
all_joint  hosp    CA        2        sample       2               4
all_joint  hosp    CA        2        sample       3               5
all_joint  death   MA        1        sample       1               3
all_joint  death   MA        1        sample       2               4
all_joint  death   MA        1        sample       3               5
all_joint  death   MA        2        sample       1               3
all_joint  death   MA        2        sample       2               4
all_joint  death   MA        2        sample       3               5
all_joint  death   CA        1        sample       1               3
all_joint  death   CA        1        sample       2               4
all_joint  death   CA        1        sample       3               5
all_joint  death   CA        2        sample       1               3
all_joint  death   CA        2        sample       2               4
all_joint  death   CA        2        sample       3               5

In this submission, all of the values in rows with the output_type_id value of 1 should be regarded as being a vector that is a single draw from the joint predictive distribution across target, location, and horizon.

On the other hand, a model that obtains separate fits for each target and location, but obtains sample trajectories across the horizons would have a submission with distinct output_type_ids for the different target/location combinations, but that are shared across the different horizons:

model_id       target  location  horizon  output_type  output_type_id  value
horizon_joint  hosp    MA        1        sample       1               3
horizon_joint  hosp    MA        1        sample       2               4
horizon_joint  hosp    MA        1        sample       3               5
horizon_joint  hosp    MA        2        sample       1               3
horizon_joint  hosp    MA        2        sample       2               4
horizon_joint  hosp    MA        2        sample       3               5
horizon_joint  hosp    CA        1        sample       4               3
horizon_joint  hosp    CA        1        sample       5               4
horizon_joint  hosp    CA        1        sample       6               5
horizon_joint  hosp    CA        2        sample       4               3
horizon_joint  hosp    CA        2        sample       5               4
horizon_joint  hosp    CA        2        sample       6               5
horizon_joint  death   MA        1        sample       7               3
horizon_joint  death   MA        1        sample       8               4
horizon_joint  death   MA        1        sample       9               5
horizon_joint  death   MA        2        sample       7               3
horizon_joint  death   MA        2        sample       8               4
horizon_joint  death   MA        2        sample       9               5
horizon_joint  death   CA        1        sample       10              3
horizon_joint  death   CA        1        sample       11              4
horizon_joint  death   CA        1        sample       12              5
horizon_joint  death   CA        2        sample       10              3
horizon_joint  death   CA        2        sample       11              4
horizon_joint  death   CA        2        sample       12              5

In this submission, the two values in rows with the output_type_id value of 1 should be regarded as being a vector that is a single draw from the joint distribution across horizons 1 and 2 that is specific to the hosp target for location MA.

Finally, a model that obtains separate marginal distributions for each target, location, and horizon would use a distinct output_type_id in every row of its submission to indicate that the samples are all different draws from different distributions.

model_id      target  location  horizon  output_type  output_type_id  value
all_separate  hosp    MA        1        sample       1               3
all_separate  hosp    MA        1        sample       2               4
all_separate  hosp    MA        1        sample       3               5
all_separate  hosp    MA        2        sample       4               3
all_separate  hosp    MA        2        sample       5               4
all_separate  hosp    MA        2        sample       6               5
all_separate  hosp    CA        1        sample       7               3
all_separate  hosp    CA        1        sample       8               4
all_separate  hosp    CA        1        sample       9               5
all_separate  hosp    CA        2        sample       10              3
all_separate  hosp    CA        2        sample       11              4
all_separate  hosp    CA        2        sample       12              5
all_separate  death   MA        1        sample       13              3
all_separate  death   MA        1        sample       14              4
all_separate  death   MA        1        sample       15              5
all_separate  death   MA        2        sample       16              3
all_separate  death   MA        2        sample       17              4
all_separate  death   MA        2        sample       18              5
all_separate  death   CA        1        sample       19              3
all_separate  death   CA        1        sample       20              4
all_separate  death   CA        1        sample       21              5
all_separate  death   CA        2        sample       22              3
all_separate  death   CA        2        sample       23              4
all_separate  death   CA        2        sample       24              5

The advantage of this proposal is that it has a clean and interpretable data structure. A disadvantage is that it seems like it might be prone to user errors by submitting teams, and there is essentially no way to validate that submissions have the output_type_id values, i.e. sample indexes, that accurately capture the dependence structure that is represented by their model.

If we go with this proposal, I think there would be no changes to make to the schema here, and so this would essentially become an issue for hubDocs.

nickreich commented 12 months ago

I like the proposal. seems conceptually sound and I like that it doesn't add anything new to the schemas.

elray1 commented 12 months ago

After thinking about this a little more, I think maybe we actually should consider making a modification to the tasks schema and implement validations slightly differently for sample output types than for the other output types.

As a motivating example, consider a hub that wants to collect up to 100 samples for a single target in each combination of 56 locations and 4 horizons. Let's also say that this hub does not want to force any particular dependence structure on the submitting models. I think this is a pretty minimal example of what this might look like for a flusight forecast hub or a covid hub reboot.

I'll first describe what a naive approach to setting this up might look like working with our existing set up, note that I think there a couple problems, and then describe my suggested approach.

1. naive approach, existing tasks.json ideas

Working within our current set up for tasks.json, the hub needs to specify a list of all possible/accepted sample indices as optional values for the output_type_id for this target. As we saw in the third example above, a model that produces separate marginal distributions at each combination of location and horizon (such as UMass-gbq) would need to have a different sample index for each of the 56*4*100 = 22400 samples it submits. So the hub needs to list a vector of integers from 1 to 22400 in its tasks.json file:

...
            "model_tasks": [{
                "task_ids": {
                    "origin_date": {
                        "required": null,
                        "optional": ["2022-11-28", "2022-12-05", "2022-12-12"]
                    },
                    "target": {
                        "required": ["inc covid hosp"],
                        "optional": null
                    },
                    "horizon": {
                        "required": null,
                        "optional": [1, 2, 3, 4]
                    },
                    "location": {
                        "required": null,
                        "optional": ["US", "01", ..., "78"]
                    }
                },
                "output_type": {
                    "sample": {
                        "output_type_id": {
                            "required": null,
                            "optional": [1, 2, 3, 4, 5, ..., 22399, 22400]
                        },
                        "value": {
                            "type": "integer",
                            "minimum": 0
                        }
                    }
                },
...

There are a couple of problems here:

We really don't want to make a hub write out a vector of 22400 optional values for the sample index. This could maybe be addressed by allowing for alternate representations of ranges of integers as suggested in "idea 1" in issue #16.
More seriously, we have lost the information that this hub wanted to accept up to 100 samples per task. This hub now accepts up to 22400 samples per task if they come from a single joint distribution across all locations and horizons (and fewer samples per task if the model only captures dependence across locations or horizons but not both, or estimates marginal distributions only)!
I think this pretty much makes an approach to validation by explicitly computing an expand_grid across optional values of task id variables and output_type_ids infeasible. That expand grid now has 56*4*22400=5017600 rows, which in a formal sense are the optionally acceptable rows in a submission, but don't really meaningfully capture what might go into a valid submission as noted in point 2.

2. alternative proposal

The best way around this that I have thought of so far is to introduce some specifications of expected output_type_ids that are specialized for the sample output_type, where instead of listing required and optional values for the output_type_id, we specify how many samples are expected for each modeling task. I'm not exactly sure of the best way to set this up, but a couple of rough ideas are in subsections below. Note that this would require special validation logic for sample output types that is different from the validation logic for other output types. In particular, I still don't think we would want to use explicit expand_grid kind of logic.

2a. using `"required"` and `"optional"` entries similar to existing specifications

Here's what this might look like for a hub that wanted to accept up to 100 samples:

...
                "output_type": {
                    "sample": {
                        "output_type_id": {
                            "required": null,
                            "optional": {n_samples_per_task: 100}
                        },
                        "value": {
                            "type": "integer",
                            "minimum": 0
                        }
                    }
                },
...

Maybe if the hub wanted to require submission of at least 100 samples per task and up to 1000 samples per task, they would have the following?

...
                "output_type": {
                    "sample": {
                        "output_type_id": {
                            "required": {n_samples_per_task: 100},
                            "optional": {n_samples_per_task: 1000}
                        },
                        "value": {
                            "type": "integer",
                            "minimum": 0
                        }
                    }
                },
...

this kind of fits with our current set up but it doesn't feel very clear to me.

2b. more directly listing a min and max number of samples

Here's an idea for what this might look like for a hub accepting up to 100 samples:

...
                "output_type": {
                    "sample": {
                        "output_type_id": {
                            "min_samples_per_task": 0,
                            "max_samples_per_task": 100
                        },
                        "value": {
                            "type": "integer",
                            "minimum": 0
                        }
                    }
                },
...

The idea here is that by setting "min_samples_per_task": 0, we make submission of samples optional.

And for a hub accepting between 100 and 1000 samples per task:

...
                "output_type": {
                    "sample": {
                        "output_type_id": {
                            "min_samples_per_task": 100,
                            "max_samples_per_task": 1000
                        },
                        "value": {
                            "type": "integer",
                            "minimum": 0
                        }
                    }
                },
...

If a hub wanted to require submission of exactly 100 samples, they could set both min and max to 100.

elray1 commented 11 months ago

Attempting to summarize the main points from in-person discussion:

Challenges

Do we really want to allow model output submissions with samples that capture different dependence structures?
1. Note that for every other output type (quantiles, pmfs, means), a hub is very clear that values are summaries of a marginal distribution per task. Having a mix of different estimated distributions in the same output type seems potentially problematic.
2. On the other hand, in previous discussions some hubs have said that they would like to allow for sample submissions even from models that only have the capability to estimate marginal distributions. These hubs want to be able to deal with samples, but also have only a limited number of contributing teams and don't want submission requirements to eliminate some teams. For example, it may be useful to collect samples from teams that are only capable of estimating marginal distributions even if the ultimate goal is to get to sample trajectories -- the hub could take on the task of doing some postprocessing of samples to try to get to sample trajectories.
The use of the sample index recorded in output_type_id to specify dependence structure is potentially confusing, and we think it is likely that some participating teams may not understand or correctly implement the sample indexing that matches the dependence structure that is actually captured by their model. We would like to be able to validate that sample indexing in submission is correct.
Do we really need sample indexes to be integers? More generally, they could be strings.
- ELR's response: personally, imposing a requirement that sample indices are integers doesn't seem too onerous, but I agree that this does not seem strictly necessary.

Ideas for solutions

Theme 1: specifying dependence structures in config files

High level, two ideas here depending on whether or not dependence structure is enforced at the hub level.

If a hub does want to enforce a required dependence structure, we could set up a mechanism for a hub to specify that, perhaps in its tasks.json config file on a per-round basis. It might be up to hub administrators whether or not they want to do this.
- Some work needs to be done to specify exactly how to set this up, but maybe something like 'samples_joint_across': ['horizon'] to say that samples should be in the form of "sample trajectories" from a joint distribution across horizons, or 'samples_joint_across': ['horizon', 'location'] to say that samples should be from a single joint distribution for all horizons and locations. The values in this vector would need to be task id variable names from that round. It's not immediately clear to me whether this would be specified at the round level, or would need to be specified within each "task id block" (i.e., each entry in the task ids array) defining a group of tasks with the same sets of possible values of all task id variables? On one hand, we know that within each round, the same task id variables are present. On the other hand, samples might be accepted model outputs only for some of the task id blocks. But it would seem to be very confusing to potentially allow for different dependence structures for different sets of targets...
Similarly, if a hub wants to allow models to use different dependence structures (again potentially with different specifications on a per-round basis), we could allow a model to specify this in their model metadata file.

With either of the above, in principle we should then be able to validate that the sample indices in a model output submission file correctly match up with the joint dependence structure that is either specified by the hub or the model's config file.

Theme 2: distinguishing between types of dependence captured by samples using different output types

The idea here is to introduce different output types that are specific to different levels of dependence that are captured in samples.

A first output type would be for samples that are from a marginal distribution per combination of task id values. We might call this marginal_sample, sample_marginal, or just sample. Note that here, the value in the output_type_id column would essentially be ignorable, and we might as well set it to NA.
A second output type or possibly multiple different output types would be for samples that are from a joint distribution across multiple values of one or more task id variables. Two possible visions here:
1. A single output type called something like joint_sample, that is used for any sample that is not from a marginal distribution. We would probably still need the stuff in the metadata config files described under the previous theme to further specify what kind of dependence structure is captured.
2. Output types that are specific to the kind of dependence structure a hub wants to capture. For example, we believe that the most common use case for samples is sample trajectories that can be thought of as draws from a joint distribution across step-ahead forecast horizons. We might introduce a new output type for this kind of sample, called trajectory_sample or sample_trajectory. This would avoid the need to specify dependence structures in metadata files, but has the disadvantage that we would need to introduce new output types for each new type of dependence structure that a hub wants to collect.

Other notes

I don't remember hearing any objections to the general idea of using sample indices to record which rows in a submission file "go together" in a single draw from a joint distribution.
It seems like there was generally support for the idea of specifying a minimum and maximum number of allowed samples per task, as in option 2b in my previous comment.

sbfnk commented 11 months ago

Just a small comment that I don't think we need to force the sample id to be numeric as we don't expect to perform any numeric operation on them - it may be in many cases be numeric for convenience but I think should be allowed to be any string that links rows in the table.

elray1 commented 11 months ago

follow-up questions/thoughts about the data type for sample ids:

I agree with Seb's point that we don't expect to perform numeric operations on sample ids
I think that in theory, potential advantages of using integers may include reduced storage size or faster operations like search, filter, group_by, and checking the number of unique ids -- though in our situation any differences by data type are likely to be small? additionally, not clear to me whether those gains would be there at all, since other values in this column may often be strings, so even if the fundamental data type for sample ids was integer, we might often still have to parse from strings?
I am having a hard time imagining a situation where it's really important, or even desirable, to allow to store some information in the contents of a string sample id?
overall, I don't see a very compelling case one way or another. i think I would vote to do whatever is easier

elray1 commented 11 months ago

Trying to get a little more concrete about what it would look like to specify dependence structures in config files, to support discussion/decision-making about whether or not we want to go that route. From comments above, goals are to allow to specify this:

on a per-round basis
either at the hub level for all submissions, or per model/round submission, up to the hub what they want to allow

I'll illustrate with an example hub that wants to collect up to 100 samples for two targets (hospitalizations, deaths) in each combination of 56 locations and 4 horizons. An excerpt of this hub's tasks.json config file might look like this:

...
    "rounds": [{
        "round_id_from_variable": true,
        "round_id": "origin_date",
        "model_tasks": [{
            "task_ids": {
                "origin_date": {
                    "required": null,
                    "optional": ["2022-11-28", "2022-12-05", "2022-12-12"]
                },
                "target": {
                    "required": ["hosps", "deaths"],
                    "optional": null
                },
                "horizon": {
                    "required": null,
                    "optional": [1, 2, 3, 4]
                },
                "location": {
                    "required": null,
                    "optional": ["US", "01", ..., "78"]
                }
            },
            "output_type": {
                "sample": {
                    "output_type_id": {
                        "min_samples_per_task": 1,
                        "max_samples_per_task": 100
                    },
                    "value": {
                        "type": "integer",
                        "minimum": 0
                    }
                }
            }
        }]
    }],
...

hub-level dependence specification

The proposal is to introduce a new field that is an attribute of the round object, at the same level as "round_id" and "model_tasks", that says what the expected dependence structure is for sample submissions for that round. This has the form of an array of names of task id variables, specifying that models should produce samples from a single joint distribution across all values of those task ids. For instance, to specify that samples are expected to be "trajectories" across forecast horizons, the config would look like:

    "rounds": [{
        "round_id_from_variable": true,
        "round_id": "origin_date",
        "model_tasks": [{
            ...
        }],
        "samples_joint_across": ["horizon"]
    }],
...

To illustrate a less common use case -- if samples are expected to be from a joint distribution across all horizons and locations (e.g. to support aggregation from the state level to the regional level), the configuration would look like:

    "rounds": [{
        "round_id_from_variable": true,
        "round_id": "origin_date",
        "model_tasks": [{
            ...
        }],
        "samples_joint_across": ["horizon", "location"]
    }],

A few notes:

I'm open to names other than samples_joint_across.
Maybe if a hub wants to specify that samples should be regarded as draws from independent marginal distributions, they should specify "samples_joint_across": []? And if they don't want to enforce any particular dependence structure for samples, they could set "samples_joint_across": null?
If a hub does not collect samples, they should not have to specify a samples_joint_across field in their round config. If a hub does collect samples in a given round, should this field be required, or we just infer a missing samples_joint_across field to be null?
These settings could be specified on a per-round basis, to allow for different expectations for joint modeling in different rounds.
The decision to encode this at the level of a round means that within a single round, the same dependence structure has to be used for all levels of the task id variables. For example, within a single round, a hub can't ask for sample trajectories ("samples_joint_across": ["horizon"]) for the hosps target but independent samples ("samples_joint_across": []) for the deaths target. I think that this is not a major limitation and it could potentially be confusing to allow for more flexiblity.

model-submission level dependence specification

According to the proposal above, if a hub wants to allow models to submit samples without specifying the desired dependence structure, they could specify "samples_joint_across": null. In that case, we might like to have a way to validate that the sample indices in a model output submission accurately represent the dependence structure that a model claims to capture. To enable that validation, we would need to collect some metadata from the model saying what joint distribution(s) its samples come from. One possibility is that we could collect this in the model's metadata file.

Here's an example of what this might look like (not sure i'm getting my yaml set up right):

team1-modela.yml:

model_id: team1-modela
model_name: Example model
samples_joint_across: horizon, location

...or if this is round-specific:

model_id: team1-modela
model_name: Example model
samples_joint_across:
    round1: horizon
    round2: horizon
    round3: horizon, location

I don't love the idea of asking teams to update their model metadata file each round with this kind of round-specific information -- we've previously said that any round-specific information would be in their free-form abstract. But I'm not sure there's a way around this if we want to be able to validate correctness of the sample indices in model output submissions in cases where the hub does not specify what dependence structure to use.

elray1 commented 11 months ago

suggestion to include this as an attribute of the sample

            "output_type": {
                "sample": {
                    "output_type_id": {
                        "min_samples_per_task": 1,
                        "max_samples_per_task": 100
                        "samples_joint_across": ["horizon"],
                    },
                    "value": {
                        "type": "integer",
                        "minimum": 0
                    }
                }
            }

elray1 commented 11 months ago

Summing up discussion and decisions from today's call:

r.e. data type:
- Hubs might want some flexibility about the data type here. Note that other output types like quantile or pmf may have numeric or string output types. It might be convenient for a hub to be able to specify that the sample index is numeric, so that if they collect quantiles and samples, after reading in model outputs, the output_type_id column is all numeric. On the other hand, a string data type is a reasonable option too.
- Therefore: eventual goal is to allow hubs to specify whether they want to allow int or string. To support this, we would need to add a “type” property and a “max_length” property to output_type_id.
- Proposal: support only integer sample ids to start, as it is simpler, but eventually add support for strings as well. See separate issue for this at #61.
- Note: hubUtils::connect_hub has an argument to control data type for this column anyways (though for purposes of validating submissions we’d still want to be clear about types)
r.e. the more general problem of encoding sample dependence structure:
- we will include only one output_type, "sample". We will not introduce other output types like "sample_trajectory".
- we will basically use the proposal in this comment, refined in this comment to add samples_joint_across as a property of output_type_id. we will also add support for models to specify sample dependence structure in their model metadata YAML file. note that models should only be providing this metadata if the dependence structure is not specified by a hub.
- we will therefore allow for the possibility that a hub might want different dependence structures for different task id variables. we could consider validating that the same dependence structure is specified in all task id groups within each round as a check, since we think hubs will most often want that. but maybe this should generate a warning or message rather than an error, in case a hub does actually want this for some reason?

LucieContamin commented 11 months ago

we will also add support for models to specify sample dependence structure in their model metadata YAML file. note that models should only be providing this metadata if the dependence structure is not specified by a hub.

I have a small question about that part, how would you enter the information for a model without sample dependence structure (each id is unique in every row), should we use NA or NULL? I mean something like:

model_id: team1-modela
model_name: Example model
samples_joint_across:
    round1: horizon
    round2: horizon, location
    round3: NA

elray1 commented 11 months ago

it seems like it may be possible to use [] to specify an empty array? https://stackoverflow.com/questions/5110313/how-do-i-create-an-empty-array-in-yaml

LucieContamin commented 11 months ago

Thanks! That makes sense and would match the json file format.

eahowerton commented 10 months ago

We have been discussing implementing this sample framework for an upcoming SMH round, and a few questions/comments have arisen through these discussions that I think might be useful to consider. Here's my attempt to summarize:

If I understand correctly, we are proposing a unique output_type_id for each independent sample. However, from the perspective of the teams, this may be more complicated to implement. Imagine a team makes predictions for each location and target independently. Then, under our current framework, the output should look like this (following the example above from @elray1).

model_id       target  location  horizon  output_type  output_type_id  value
horizon_joint  hosp    MA        1        sample       1               3
horizon_joint  hosp    MA        1        sample       2               4
horizon_joint  hosp    MA        1        sample       3               5
horizon_joint  hosp    MA        2        sample       1               3
horizon_joint  hosp    MA        2        sample       2               4
horizon_joint  hosp    MA        2        sample       3               5
horizon_joint  hosp    CA        1        sample       4               3
horizon_joint  hosp    CA        1        sample       5               4
horizon_joint  hosp    CA        1        sample       6               5
horizon_joint  hosp    CA        2        sample       4               3
horizon_joint  hosp    CA        2        sample       5               4
horizon_joint  hosp    CA        2        sample       6               5
horizon_joint  death   MA        1        sample       7               3
horizon_joint  death   MA        1        sample       8               4
horizon_joint  death   MA        1        sample       9               5
horizon_joint  death   MA        2        sample       7               3
horizon_joint  death   MA        2        sample       8               4
horizon_joint  death   MA        2        sample       9               5
horizon_joint  death   CA        1        sample       10              3
horizon_joint  death   CA        1        sample       11              4
horizon_joint  death   CA        1        sample       12              5
horizon_joint  death   CA        2        sample       10              3
horizon_joint  death   CA        2        sample       11              4
horizon_joint  death   CA        2        sample       12              5

Presumably, when a team generates these samples, they have samples 1:n for each unique target/location and would have to perform a second step to transform output_type_id to the unique values we are asking for (e.g., in the above example, samples 1,2,3 for CA hosp would become 4,5,6). This seems to (1) put more burden on teams, and (2) create more possibilities for error.

Possible alternative: Since we now have information about dependence structure in the metadata, can we perform this transformation step in post-processing? In this instance, all teams, regardless of model dependence structure, would have the same number of samples (i.e., all submissions would look like the example below), and the assumption would be that samples are independent when from task_id variables that are not included in samples_joint_across metadata field. I see downfalls with this approach too (e.g., information is not included in submission directly), but it might be worth discussing further.

model_id       target  location  horizon  output_type  output_type_id  value
all_structures  hosp    MA        1        sample       1               3
all_structures  hosp    MA        1        sample       2               4
all_structures  hosp    MA        1        sample       3               5
all_structures  hosp    MA        2        sample       1               3
all_structures  hosp    MA        2        sample       2               4
all_structures  hosp    MA        2        sample       3               5
all_structures  hosp    CA        1        sample       1               3
all_structures  hosp    CA        1        sample       2               4
all_structures  hosp    CA        1        sample       3               5
all_structures  hosp    CA        2        sample       1               3
all_structures  hosp    CA        2        sample       2               4
all_structures  hosp    CA        2        sample       3               5
all_structures  death   MA        1        sample       1               3
all_structures  death   MA        1        sample       2               4
all_structures  death   MA        1        sample       3               5
all_structures  death   MA        2        sample       1               3
all_structures  death   MA        2        sample       2               4
all_structures  death   MA        2        sample       3               5
all_structures  death   CA        1        sample       1              3
all_structures  death   CA        1        sample       2              4
all_structures  death   CA        1        sample       3              5
all_structures  death   CA        2        sample       1              3
all_structures  death   CA        2        sample       2              4
all_structures  death   CA        2        sample       3              5

It seems that it will be very important to have a clear definition of what we mean by "sample" and what conditions should be met to consider something a "draw from a joint predictive distribution". We discussed three potential types of samples: a. samples that differ because of parameters/initial conditions only (imagine a deterministic model) b. samples that differ because of stochasticity only (image a stochastic model with noise included but each draw has the same parameters, initial conditions, etc.) c. samples that differ because of both parameters/initial conditions and stochasticity Are there cases of (c) where we would want to know which samples were from matched parameters/initial conditions but differed stochastically? A hub that wanted to do this could add an additional column to differentiate the two (see example below with added parameter_set column), but it is currently not clear what the output_type_id column is describing (i.e., is parameter_set the correct column to add, or should it be stochastic_run?).
```
model_id            target  location  horizon  output_type  parameter_set output_type_id  value
multiple_variation  hosp    MA        1        sample       1               1             3
multiple_variation  hosp    MA        1        sample       1               2             4
multiple_variation  hosp    MA        1        sample       1               3             5
multiple_variation  hosp    MA        1        sample       2               1             3
multiple_variation  hosp    MA        1        sample       2               2             4
multiple_variation  hosp    MA        1        sample       2               3             5
```

sbfnk commented 10 months ago

This comment by @eahowerton is close to what I would have given as an answer to @elray1's question above:

I am having a hard time imagining a situation where it's really important, or even desirable, to allow to store some information in the contents of a string sample id?

Given that we don't need it to be a number it seems to me that allowing more flexibility here potentially is of benefit to the hubs. E.g. in the example above samples could be called anything that teams use to draws from a joint distribution, e.g. a team could have

model_id       target  location  horizon  output_type  output_type_id  value
horizon_joint  hosp    MA        1        sample       MA_1               3
horizon_joint  hosp    MA        1        sample       MA_2               4
horizon_joint  hosp    MA        1        sample       MA_3               5
horizon_joint  hosp    CA        1        sample       CA_1               3
horizon_joint  hosp    CA        1        sample       CA_2               4
horizon_joint  hosp    CA        1        sample       CA_3               5

if these are independent by location or

model_id       target  location  horizon  output_type  output_type_id  value
horizon_joint  hosp    MA        1        sample       1               3
horizon_joint  hosp    MA        1        sample       2               4
horizon_joint  hosp    MA        1        sample       3               5
horizon_joint  hosp    CA        1        sample       1               3
horizon_joint  hosp    CA        1        sample       2               4
horizon_joint  hosp    CA        1        sample       3               5

if they are joint by location, or they could use the output_type_id value to indicate where samples are from different parameter sets (p1, p2, ...) or stochasticity (s1, s2, ...). A common output_type_id would then always indicate that samples come from a joint distribution, i.e. combine multiple rows into a unit of observation - this could also be used in a check when validating against the config file.

If we're being very explicit in the config file as suggested above then this may not be necessary but I imagine we may not be able to cater for all possible use cases so it's a question of how prescriptive we want to be here vs. allowing some flexibility - and whether we're worried about creating confusion if common output_type_id values sometimes indicate draws from a joint distribution and sometimes not, depending on the config file.

elray1 commented 10 months ago

Good questions and ideas. Here are a few thoughts:

I think that we should have it as a goal to keep the data files self-contained, so that you don't need to refer to config files or metadata files to interpret the model outputs correctly. I agree that our tools could do some post-processing at the time we load data files, but not everyone may use our tools.
We anticipate that teams may struggle to get sample indexing right.
- One thing we can do about this is just to try to have helpful messaging outputs from validations on sample submission, e.g.: "{The hub config file}/{Your model metadata file} indicates that samples should be independent across different locations, but some of the sample indices in the output_type_id column of your submission were shared across locations. Please update your submission to use different sample indices for different locations."
- Overall, I'm not convinced that getting integer sample indices right would be a huge burden on contributing modelers, especially if we can set up helpful validations. But I do see the point that it's some extra work and an opportunity for mistakes.
- Seb's suggestion of just using strings also provides another way around this, e.g. by concatenating the location with the sample index to ensure that output_type_ids are different in different locations. (I'd note that from here, it's not hard to get to distinct integers e.g. in R via as.integer(factor(...)) -- but I do see that this discards some information -- see next point)
It sounds like there is a general desire to allow for the possibility of collecting/storing information about how samples were generated. A use case is the one in Emily's second point, where we might want to track the combination of parameters/initial conditions and stochastic replicate that was used to generate a particular sample.
- I see two ideas for what this might look like:
  1. Encode this information in a string that's used for the output_type_id. In Seb's notation, the output_type_id might then have values like "p1_s1", "p1_s2", "p2_s3". Or if these are independent across locations, "MA_p1_s1" and "CA_p1_s1". If we want this information to be usable, I think we would need to validate that these output_type_id values were consistently and correctly formatted across models at submission time, so that if the hub wants to they can do some string parsing to extract the parameter set and stochastic replicate indices. So maybe the hub config would have to provide regex that's used to validate the sample indices upon submission? Where if the hub doesn't care about storing information in the output_type_id column, they could just specify a regex that matches any string up to some maximum length.
  2. Store this information in multiple columns. For example, maybe the hub would collect parameter_set and stochastic_replicate columns, which determine something like an output_type_id in combination. The hub config would have to specify something like what columns put together specify an output_type_id, and what valid combinations of values across those columns look like. This could be kind of like what we've set up for target keys.
- I think both of these are possible in theory and both have limitations, but the second seems like a much more substantive break from things we've laid out so far. So I guess I'd be inclined to go with option a here, allowing a hub to accept string output_type_ids and specify a regex saying how they want information to be represented in those strings?

elray1 commented 10 months ago

Summing up the main points from discussion about this today:

Overall take-aways about validations and information storage

we think that the built-in hubverse structures and validations should be relatively minimal. We will keep an output_type_id column, with a goal that this may have either integer or character values for the sample output type. We will not provide a regex validation for information in strings, and we will not handle validating other columns with this information. The validations that will be provided by default will be minimal checks that. sample indices are consistent with the dependence structure specified by the hub or the model's metadata file:
1. after subsetting to rows with sample for the output type, within each group of rows defined by a combination of task id variables, the number of unique sample indices is between the minimum and maximum number of samples per task as specified in the hub's tasks config file. Error if this condition is not met.
2. after subsetting to rows with sample for the output type, the same sample index (i.e. value for output_type_id) should not appear in rows with different values for any task id variable that is not listed in a samples_joint_across field in the config or metadata file. (or, applying De Morgan's law: all sample indices should be distinct in rows with different values for any task id variable that is not listed in a samples_joint_across field). Error if this condition is not met.
3. after subsetting to rows with sample for the output type, we expect to see some duplicated sample indices in rows with different values for task id variables that are listed in a samples_joint_across field. I'm not sure if we want an error or maybe just a warning/informational message if this condition is not met? To-do to figure out how careful to be in this check (e.g. if samples_joint_across includes "horizon", maybe we should expect any sample indices that appear for one horizon to also appear at all other horizons, for fixed values of the non-joint-across task id variables)?
If a hub wants to use strings for sample indices and wants to ensure that those strings contain information that can be reliably extracted, they will need some kind of regex validation. We will not provide that, but the hub could write a custom validation step function.
We might like to allow a hub to specify additional columns that would be included in a submission file, but are not task ids or one of our other standard columns. Examples include parameter_set and stochastic_replicate. Hubverse default tooling would not validate contents of these columns, but a hub could add a custom validation step if desired. That said, we might need to get something into the schema config for the hub that says "these extra columns are allowed/should not result in an error".

Misc other note: is `parameter_set` just a task id?

This might depend on how a hub is thinking about parameter_set, but my opinion is that in general, the parameter set is not really a task id variable so much as a piece of metadata about how a sample was generated. Some considerations related to this:

If a hub wants to collect 100 trajectory "samples" for each combination of location and target, in most cases that requirement will apply to the total number of samples that are generated, whether they came from different parameter sets, stochastic replicates, or some combination of the two. But if we add parameter_set as a task id variable, with 100 possible values for the parameter set, we could end up collecting::
- up to 100*100 samples per location & target from a stochastic model with a deterministic core that takes initial parameter settings as an input.
- but only 100 samples per location & target from a deterministic model where the concept of a "stochastic replicate" is not relevant
More on a vague level:
- from the perspective of a public health end user of the predictions, in general a forecasting task will involve a specification of things like what location and target variable we want to predict, not what specific numeric values should be used as initial conditions
- in some sense, it seems like model outputs for different models for the same prediction task should be comparable. for example, if we are doing forecasting, we should be able to score forecasts for the same prediction task and compare those scores in a meaningful way to say what model was better. For that to work, the task id variable values have to "mean the same thing" for different models. But "parameter set 1" might mean very different things for two different models.

Maybe an exception to these points is if a hub assigns some meaning to each parameter set that is agreed upon by all participating models. This would turn a parameter set into something more like a scenario, which does feel like a task id.

nickreich commented 3 months ago

In this comment above, @elray1 said that the decision was made to

add samples_joint_across as a property of output_type_id. we will also add support for models to specify sample dependence structure in their model metadata YAML file. note that models should only be providing this metadata if the dependence structure is not specified by a hub.

However, I am concerned that this does not leave room to specify that a model submission has a different dependence structure than what is specified by the hub. My proposal (in discussion with @bsweger ) is that we have the hub specification in the tasks-config file as the "default" dependency specification and that a model could indicate that they had a different specification by using the model-metadata specification as proposed here.

One question is whether hub admins should have the ability to allow such an "override" of default specifications. Our feeling is that for now we should keep it simple and NOT include an additional hub-specific config field about this, but that hub admins could use criteria about how samples are generated as eligibility criteria for inclusion in an ensemble. E.g. "to be eligible for inclusion in an ensemble, a model's submission in a given round must at least be joint across horizons, and could optionally be joint across locations." or something like that.

elray1 commented 3 months ago

Seems reasonable in general, except that I think it would be quite reasonable for a hub to:

not want to allow submissions of samples that didn't at least capture some minimal level of dependence. so, they might allow a per-model override saying "i also capture spatial dependence" but not a per-model override saying "i don't capture temporal dependence". The reason being that they don't want downstream users of forecast data to mis-use the data and do invalid predictions of, e.g., cumulative incidence or peak timing.
not want to deal with having to track per-model dependence structures.

I could see myself being in camps 1 and 2 in future hub administrative efforts, so i would like to support people like future me

LucieContamin commented 3 months ago

To add some examples, currently on US SMH, we are implementing a new format to track dependence or what we call "pairing" or "grouping" information.

We have a wiki with the submission file format documentation, here

We store the expected minimal level of dependence expected by the hub in the tasks.json, for example:

"output_type":{
                    "sample":{
                        "output_type_id":{
                            "min_samples_per_task": 100,
                            "max_samples_per_task": 100,
                            "samples_joint_across": ["horizon", "age_group"]
                        },
                        "value":{
                            "type":"double",
                            "minimum":0
                        }
                    }
                },

During the submission, the validation is automatically checking the minimal level but does not register if a team add another level of dependence (for example if their submission is joint across horizon, age_group and location, the submission will be accepted, but if it's joint only across horizon it will not be accepted).
We are currently not asking the team to provide their "grouping" information in the metadata. It's relatively easy to extract from the submission file. However, I can see that some hubs might want to have that information provided by the team, if you have a lot of team participating and weekly submission.

nickreich commented 3 months ago

In response to @elray1 's comment:

I could see myself being in camps 1 and 2 in future hub administrative efforts, so i would like to support people like future me

I agree that a hub might want to do this, but wouldn't a simpler development approach (at least for now) be to have hubs do this kind of validation on their own. E.g., from @LucieContamin 's comment it sounds like SMH is adding some custom validations (which could potentially be folded in to future central hubverse efforts).

So the process/documentation for hubverse tooling would remain in the near-term that

hubs specify a default sample dependency structure using the hub-level config samples_joint_across
if allowed in model-metadata by hub admins, models can specify deviations from that dependency structure (unless a hub admin wants to be very restrictive about the kinds of models submitted, I suspect that most of the time this would be allowed, but a hub could be strict! and maybe I am wrong about what would be "typical".)
if deviations are allowed at all, hubs can choose which kind of deviations to allow, but for now hub admins are on their own to enforce (as with SMH).

elray1 commented 3 months ago

Probably worth a weigh-in from anna, but i think the validation step in item 3 on this list might be among the easier things to implement in this whole set up, and not particularly more challenging or involved than item 2. e.g., I could imagine addressing 2 and 3 simultaneously by introducing a pair of booleans along the lines of tasks[[round_id]][[task_group_index]].sample_dependence_subsets_allowed and tasks[[round_id]][[task_group_index]].sample_dependence_supersets_allowed, which if set to false trigger checks that the model's specified dependence structure in a given round is not a strict sub/superset of the hub's specified dependence structure respectively. if it's that easy, I don't really see a reason to hold off on support for this, particularly given that we have an existing hub where this is the desired behavior. Of course we can/should split this into a separate issue/to-do when the time comes and prioritize it as appropriate; it just doesn't seem particularly more cumbersome or challenging to address to me. If I'm wrong about this, happy to defer or delay.

nickreich commented 3 months ago

That is a fair point. I'm also concerned about conceptual complexity for new/existing/future users in addition to the complexity of implementation and/or maintenance and support of new metadata fields. But I agree that a compelling case to build it is if a current hub could use it and needs it now.

elray1 commented 3 months ago

concern about conceptual clarity is very legitimate.

my take, though, is that implementation, maintenance, and user onboarding will all be smoother if we decide what we want once and do it up front. so i would vote to either:

decide now that we're just not going to support this as part of standard hubverse tools
decide now that we're going to support it, implement it as part of an initial round of dev if possible, and think about what the right defaults are so that as few hub admins as possible have to think about this setting (i think we can choose defaults, right?)

eahowerton commented 3 months ago

Just adding another minor reflection from the SMH experience. Communicating and implementing the plan to collect dependence structure information was a real challenge. Defining and standardizing "dependence" across different modeling frameworks was not as straightforward as it originally seemed.

Overall, I'd say it has turned out to be a fruitful endeavor, and I can see clear reasons to support it more generally. And perhaps, the more we support/invest in this, the more standard it will become. But at the same time, I agree with @nickreich's concern about introducing additional conceptual complexity and tackling the associated communication challenges, especially when it's not clear how many hubs would use this functionality.

nickreich commented 3 months ago

That is a reasonable point, especially since the way SMH implemented it (if I understand correctly, based on the link Lucie provided above), the dependence ended up being specified in additional non-hubverse-compliant columns and in ways that talked about "grouping by" rather than "joint across" which are related but kind of complementary concepts...

elray1 commented 3 months ago

I'm thinking it might help also if we get specific about what we're trying to do here and consider creating more/separate fields specific to those purposes. It seems like we have thought of 2 kinds of validations we might like to do on model output submissions:

Is a model output submission consistent with what a hub wants to collect?
Is the indexing in a model output submission consistent with the dependence structure modeler claims they are capturing?

Additionally, Nick has brought up a third thing that could be nice -- we might like to:

allow a hub to specify a hub-default (or round-default or round/target-group-default) dependence structure that saves contributing teams who match that default from having to put anything in their model metadata file.

Idea 1

Thinking about 1 and 3, I wonder if we could capture the ideas more clearly with tasks.json fields like:

samples_joint_across_default
samples_joint_across_minimal

Note, I'm omitting a field like samples_joint_across_maximal because it seems like in general a hub would not want to reject samples that met their minimal dependence needs and also captured dependence across other things like locations or age groups.

For example, a hub that prefers modelers to submit trajectories, joint across target_date, but will take any samples they can get, might specify

samples_joint_across_default: ["target_date"]
samples_joint_across_minimal: [] (or null or whatever appropriate convention)

Then our validations process could involve checks like this:

Did the modeler put a specification of samples_joint_across in their model metadata file? If so, verify that what they put there contains everything in samples_joint_across_minimal.
Grab the samples_joint_across specification given by the model metadata file if provided there, or the hub/round/target default if nothing was provided by the modeler. Check that the sample indexing in the submission file is consistent with that.

Idea 2

Noting that the above are not actually the checks that were done by SMH. My understanding is that they:

did not deal with the validation in step 2 here, or generally tracking modeler-specific changes to dependence structure in a formal way in their model metadata files
did a version of the check in step 1 by inferring the dependence structure captured by the model from their sample indices.

If we just went that way, which is simpler, we would not bother with allowing models to put stuff about samples_joint_across in their model metadata files, and we would not need samples_joint_across_default. In the SMH example, all that would have been needed was a setting like samples_joint_across_minimal: ["horizon", "age_group"].

Then a hub like the one Nick mentioned that would really like samples to be joint across target_date but will take what they can get would just specify samples_joint_across_minimal: [] in their hub config files and would write some human-readable note to their contributing teams about how they really hope teams will submit trajectories.

What we would lose with this approach is the ability to check that an output submission with sample indexing implying dependence is captured across dates, locations, and age groups was set up correctly by the contributing team.

nickreich commented 3 months ago

After spending some time iterating with @bsweger on this for thinking about setting up a variant hub, and in discussion with @nikosbosse about related ideas in scoringutils, I wonder if maybe introducing different terminology/concepts might be useful. In scoring utils, they have the concept of a "forecast unit" which is equivalent to the hubverse idea of a unique combination of "task id variables". That is, the rows that share one unique set of task id variables define a single "forecast unit". And that forecast unit can have a single observed value for its target data.

I find the concept of "joint across" to be jargony, and actually a non-trivial statistical concept (i.e., well beyond intro stats). So I've been trying to think about ways that we could represent and document these ideas in ways that both speak to a statistically literate modeling audience and a data-literate but "stats-naive" hub admin/dev audience.

As a concrete example, here is a table showing 3 separate forecast units, where the task id variables are "origin_date", "horizon", and "location". There are 9 rows, as for each of the 3 forecast units we have three samples. (I've left the data values as just "-" for now.

origin_date	horizon	location	output_type	output_type_id	value
2024-03-15	-1	MA	sample	1	-
2024-03-15	0	MA	sample	2	-
2024-03-15	1	MA	sample	3	-
2024-03-15	-1	MA	sample	4	-
2024-03-15	0	MA	sample	5	-
2024-03-15	1	MA	sample	6	-
2024-03-15	-1	MA	sample	7	-
2024-03-15	0	MA	sample	8	-
2024-03-15	1	MA	sample	9	-

In any sample data, you could never have two rows of data with the same output_type_id that share the same set of values for all task-id variables. In the table above, each of the samples is independently drawn, meaning that the samples are "joint across" nothing, or [ ].

I think a fundamental idea at the core of this discussion about samples is that we are changing the forecast unit. That is to say, that now we are saying that a forecast unit could be comprised of multiple unique unique combinations of task-id variables. For example, in another version of the table above, we might say that for a given location and origin_date, one sample corresponds to a grouping of three unique task-ids that share the same location and origin-date but not the same horizon. So in this case, there are only three "forecast units" or independent samples, not nine as there were above.

origin_date	horizon	location	output_type	output_type_id	value
2024-03-15	-1	MA	sample	1	-
2024-03-15	0	MA	sample	1	-
2024-03-15	1	MA	sample	1	-
2024-03-15	-1	MA	sample	2	-
2024-03-15	0	MA	sample	2	-
2024-03-15	1	MA	sample	2	-
2024-03-15	-1	MA	sample	3	-
2024-03-15	0	MA	sample	3	-
2024-03-15	1	MA	sample	3	-

In this case, the samples are "joint across horizon". Note that this does not violate the above rule: no two rows with the same output_type_id share exactly the same values for task-id variables. Instead, the output_type_id values group the rows into sets of task-id variables that are "connected" by the model, or that the model is "joint across".

Here is another example with two locations where data are joint across horizon:

origin_date	horizon	location	output_type	output_type_id	value
2024-03-15	-1	MA	sample	1	-
2024-03-15	0	MA	sample	1	-
2024-03-15	1	MA	sample	1	-
2024-03-15	-1	MA	sample	2	-
2024-03-15	0	MA	sample	2	-
2024-03-15	1	MA	sample	2	-
2024-03-15	-1	TX	sample	3	-
2024-03-15	0	TX	sample	3	-
2024-03-15	1	TX	sample	3	-
2024-03-15	-1	TX	sample	4	-
2024-03-15	0	TX	sample	4	-
2024-03-15	1	TX	sample	4	-

And another where they are joint across horizon AND location. This means that when output_type_id==1, all of those six rows are taken from the same model realization, where three steps ahead in both MA and TX share some information within the model:

origin_date	horizon	location	output_type	output_type_id	value
2024-03-15	-1	MA	sample	1	-
2024-03-15	0	MA	sample	1	-
2024-03-15	1	MA	sample	1	-
2024-03-15	-1	MA	sample	2	-
2024-03-15	0	MA	sample	2	-
2024-03-15	1	MA	sample	2	-
2024-03-15	-1	TX	sample	1	-
2024-03-15	0	TX	sample	1	-
2024-03-15	1	TX	sample	1	-
2024-03-15	-1	TX	sample	2	-
2024-03-15	0	TX	sample	2	-
2024-03-15	1	TX	sample	2	-

Maybe another way to frame this in terms of validation or how to reverse engineer the above datasets is that when something is specified as "joint across variables x and y" that means that for all rows that share an output_type_id, those rows are expected to have multiple unique values of x and y (maybe all unique values of x and y present in any "sample" row of the submitted data?). I've actually been struggling to come up with a specific set of validations for these data.

Concrete suggestions

So to return to nomenclature, I wonder if sample_grouping might be more a more clear term than samples_joint_across, where the values of the accompanying array would stay the same? E.g.

sample_grouping (or sample_task_grouping?) could be a field in tasks.json > rounds > model_tasks > output_type > sample > output_type_id
min_.. and max_samples_per_task_group would replace the current min_.. and max_samples_per_task
I think introducing the concept of "task groups" or "task groupings" that refer to the unique combination of values of task-id variables within the sample_grouping columns will be useful.

So the tasks.json might look like this:

"output_type":{
    "sample":{
        "output_type_id":{
            "min_samples_per_task_group": 100,
            "max_samples_per_task_group": 100,
            "sample_grouping_minimal": ["horizon"],
            "sample_grouping_default": ["horizon", "location"]
        },
        "value":{
            "type":"double",
            "minimum":0
        }
    }
}

The above configuration would trigger the following checks on any submitted files that are assumed to have the default setting:

for rows that share a single output_type_id, the combination of values in the columns specified by sample_grouping must be unique (i.e. not duplicated) in those rows
the combinations of values in one task grouping (i.e. the rows that share a single "output_type_id" value) must be the same as in all other task groupings.
maybe others?

Hopefully this long treatise doesn't muddy the waters further.

nickreich commented 3 months ago

I'm also wondering if this concept of "task grouping" might be related to other concepts in hubverse, like for compositional data, where we might want a "variant" task-id column to refer to a unique grouping of rows whose values we expect to sum to 1.

bsweger commented 3 months ago

@nick Thanks for calling out the "data-literate/stats-naive" category of hubverse user here.

In that vein, a follow-up question. When looking at your original example table (the first table in this comment, with 9 unique output_type_ids), how would someone know that it represents 3 samples? It seems like the sample number is a key component of understanding how output_type_id uniqueness works, but it's implicit rather than explicit.

Is that a correct characterization, or am I missing something obvious?

elray1 commented 3 months ago

I agree with Nick's criticisms of the joint_across terminology.

When I started writing this comment I was on board with a "task grouping" terminology. But then I tried to use it to answer Becky's question just above, and realized that actually it impacts grouping (in order to split a df of model outputs into groups corresponding to a modeled observational unit) in exactly the opposite way that you might expect from that name -- it describes the variables that you would remove from a group_by operation in order to divide rows up into the observational units. In Nick's first example, with a separate output_type_id for each row, if you do df |> group_by(origin_date, horizon, location, output_type, output_type_id), each group in the result will contain one sample for one observational unit. In Nick's second example, "one sample corresponds to a grouping of three unique task-ids that share the same location and origin-date but not the same horizon." That means that if we want a separate group for each sample an observational unit, we should not group by the horizon:df |> group_by(origin_date, location, output_type, output_type_id). Note that this is actually exactly what Nick's text described, "a grouping of three unique task-ids that share the same location and origin-date", i.e., group by location and origin date. So "sample_grouping_minimal": ["horizon"] might be a confusing term to use in this setting, because "horizon" is "the thing we have to leave out in order for what's left over to describe an observational unit"...

Unfortunately, I can't think of a better name immediately, though I will keep thinking about it... I do think it's reasonable to think of those samples as "going in one group" together, this just interacts oddly with the grouping operation you'd want to do with a data frame of model outputs... Maybe the problem is that the word "group" is too flexible -- group what, for what purpose?

r.e. "I'm also wondering if this concept of "task grouping" might be related to other concepts in hubverse, like for compositional data, where we might want a "variant" task-id column to refer to a unique grouping of rows whose values we expect to sum to 1." -- this makes sense, but also seems potentially complex. e.g., what if I collect trajectories for variant proportions? Then I have task groupings by variant (sum-to-1) and also by variant+target date (trajectories tracking variants). Is it more helpful than not to keep track of these things? Maybe. Do we need to keep track of those things? If so, are there any other alternatives for how we might do so?

(Becky, I think a more complete/direct response to your question might be that in the first example, if you group by all of the task id variables origin_date, horizon and location, there are 3 rows in each group.)

bsweger commented 3 months ago

(Becky, I think a more complete/direct response to your question might be that in the first example, if you group by all of the task id variables origin_date, horizon and location, there are 3 rows in each group.)

How would we know the number of samples in the last example of that comment?

Is it still 3?

nickreich commented 3 months ago

Agree with @elray1 that the verb "group" is complicated here and may be too loaded to use.

Throwing some additional miscellaneous thoughts in here about appropriate terms:

in scoringutils, they use the idea of a "forecast unit" to represent a single set of unique "task ids"
@nikosbosse and I had talked about something like a "compound forecast unit" to represent the situation where multiple forecast units/task-ids are being collected into one new larger dependent "unit" or "set"/collection of task-ids.
so maybe a parallel in the hubverse could something like sample_taskid_set with min/max_samples_per_taskid_set, sample_taskid_set_minimal and sample_taskid_set_default as the configuration fields?

To @bsweger 's question:

How would we know the number of samples in the last example...? The number of samples could always be determined by the number of times that each unique combination of task-id variables appear in the submission. So for the last example it would actually just be 2 samples since each task-id-set (e.g., one set is origin_date == "2024-03-15" & horizon == -1 & location == "MA") appears exactly twice in the provided data.

LucieContamin commented 3 months ago

I am a bit unclear on the meaning behind: samples_joint_across_default and samples_joint_across_minimal. As I understand, default is the expected format but the hub will accept other levels, with at minimal samples_joint_across_minimal, is that correct? Then, in this case, why don't we just use one level of specification for the dependence?

For the terminologies, I agree too but I don't have any other ideas. I like the "set" idea.

elray1 commented 3 months ago

r.e. the meaning behind samples_joint_across_default, I think this field is mainly/only useful in a hub that:

allows/requires modelers to specify the dependence structure that their samples capture in their model metadata file (for the purpose of enabling validations to check that the sample indices in the submitted model output file correctly align with that intended dependence structure)
wants to save those modelers the trouble of doing so if they match the hub default

bsweger commented 3 months ago

Chiming in with a +1 for set. That word is well understood by data practitioners beyond the realm of statistics.

The reason I keep asking about the number of samples...wouldn't you need to know that in order to reason about the contents of output_type_id?

sample_num	origin_date	horizon	location	output_type	output_type_id	value
1	2024-03-15	-1	MA	sample	1	-
1	2024-03-15	0	MA	sample	1	-
1	2024-03-15	1	MA	sample	1	-
2	2024-03-15	-1	MA	sample	2	-
2	2024-03-15	0	MA	sample	2	-
2	2024-03-15	1	MA	sample	2	-
1	2024-03-15	-1	TX	sample	3	-
1	2024-03-15	0	TX	sample	3	-
1	2024-03-15	1	TX	sample	3	-
2	2024-03-15	-1	TX	sample	4	-
2	2024-03-15	0	TX	sample	4	-
2	2024-03-15	1	TX	sample	4	-

If you add the sample number to @nickreich's two locations where data are joint across horizon example, you can see the info through the lens of relational database theory.

the set of forecast units = {'origin_date', 'horizon', 'location'}
the compound forecast unit = {'origin_date', 'location'}

i.e., joint_across can be inferred as the set of forecast units - the compound forecast unit (horizon in this case)

I think we can then say that output_type_id has a functional dependency on sample_num + all columns in the compound forecast unit.

I assume that we're not asking hub participants to submit a sample_num as part of their model-outputs, but is it correct to assume that it's something we'd need to derive as part of the validation process? I can't reason about these examples without adding it.

LucieContamin commented 3 months ago

Thanks @elray1 for the additional information, sorry to insist but just to be sure I understand. The idea here is to use:

samples_joint_across_default/sample_grouping_default: "preferred" dependence structure but accepted others. If a team submits projections without "join across/grouping" information in their metadata file than this structure is expected and will be use for validation
samples_joint_across_minimal/sample_grouping_minimal: minimal dependence structure accepted. If a Hub accept any level of dependence, it will be set to null or [] (whatever we decide).

So, in the case where both (default and minimal) are set to the same value, then it will be expected that the submissions have at least the default "preferred" dependence structure. Is that correct?

In that case:

does that mean we allow teams to have different "level" of information in their metadata? and just a note, I think should we store that information per round too (in the metadata).
and does that mean that we are going to validate the team dependence structure against either the default or what the team provide and also that is contains the minimal dependence structure, not just that it contains the minimal expected dependence?
last question, I am not sure we need/want to include that but just in case: how do we represent/validation complex dependence structure, for example, we can imagine the following example, where the samples are grouped by horizon and "sub-group" of location (group 1: MA, TX; group 2: CA, FL):

origin_date	horizon	location	output_type	output_type_id	value
2024-03-15	-1	MA	sample	1	-
2024-03-15	0	MA	sample	1	-
2024-03-15	-1	MA	sample	2	-
2024-03-15	0	MA	sample	2	-
2024-03-15	-1	TX	sample	1	-
2024-03-15	0	TX	sample	1	-
2024-03-15	-1	TX	sample	2	-
2024-03-15	0	TX	sample	2	-
2024-03-15	-1	CA	sample	3	-
2024-03-15	0	CA	sample	3	-
2024-03-15	-1	CA	sample	4	-
2024-03-15	0	CA	sample	4	-
2024-03-15	-1	FL	sample	3	-
2024-03-15	0	FL	sample	3	-
2024-03-15	-1	FL	sample	4	-
2024-03-15	0	FL	sample	4	-

Sorry again if I am missing anything.

@bsweger , I don't think we want to add a column for the sample_num, I don't know if it's already integrated in the validation or not, but in the SMH validation, we test the number of samples by calculating the number of repetition of unique set of projection units/task id columns set. We force that all the unique "set" should have the same number of samples. For example, if a team provide 100 samples, each task id set is repeated 100 times. Does that answer your question?

elray1 commented 3 months ago

r.e. Lucie's questions, my answers would be:

So, in the case where both (default and minimal) are set to the same value, then it will be expected that the submissions have at least the default "preferred" dependence structure. Is that correct?

Yes

does that mean we allow teams to have different "level" of information in their metadata? and just a note, I think should we store that information per round too (in the metadata).

If I understand the question, yes - depending on whether they match the default or not, different teams might have different stuff that they need to put into their metadata files. This is a bit messy.

Agreed this information is per-round (and potentially even per-target-group). Is it reasonable to expect modelers to get this metadata into their model metadata files?? I'm not sure.

and does that mean that we are going to validate the team dependence structure against either the default or what the team provide and also that is contains the minimal dependence structure, not just that it contains the minimal expected dependence

Yes

last question, I am not sure we need/want to include that but just in case: how do we represent/validation complex dependence structure, for example, we can imagine the following example, where the samples are grouped by horizon and "sub-group" of location (group 1: MA, TX; group 2: CA, FL):

I see two options:

Allow this as long as the groups are finer than the thing that was listed . For example, the submission you gave might be "valid" if samples_joint_across (whatever we end up calling it) is "horizon", but not if it is ["horizon", "location"] because those samples are not joint across all locations. But probably a good idea to issue a warning or message in this kind of setting...
Throw an error if groups don't correspond to something that can be read from the metadata file.

I'm not sure which of those two options I prefer, they both seem wrong.

elray1 commented 3 months ago

I think another very reasonable direction to go could be more like "Idea 2" in my comment above, more like my understanding of what SMH is doing: forget about the "default" value for this, and forget about allowing modelers to submit per-round or per-target-group metadata about their dependence structure. Validate only that the dependence structure implied by their sample indices captures at least as much as the minimal requirements of the hub. This loses the ability to really carefully check that the modelers are doing indexing in a way that reflects what they want. But capturing the metadata required to enable that validation may be out of reach anyways, in terms of the complexity of what we'd be asking modelers to provide. And maybe instead of that formal validation we could just provide messages in the validation output describing the dependence structure we've identified based on the submitted sample indices.

LucieContamin commented 3 months ago

I like this option ("Idea 2"). I had the same idea when we implement it in SMH, to have the validation returns the dependency structure information. I did not had the time to implement it yet.

However as you say, the Idea 2 has the limit to not being able to really check the modelers indexing or to force a dependence structure on all the submissions.

So maybe a hybrid of 1 and 2 might help:

samples_joint_across/sample_grouping: dependence structure accepted
a tag (boolean?) to say if additional structure is accepted So for example a hub can specify the expected projection "grouped" by horizon and they do or do not accept additional "grouping"

I also like the idea of having the information of the dependence structure stored somewhere because it's not easy to find that information just by looking at the file, so I think it's a good idea to store the information in the metadata. Either at hub level if we want to only have the minimal dependence structure information or only one structure accepted, or at team level if we want more information. If the team provides it then we can use it for validation I guess.

It did not solve the complex dependence structure issue, I prefer the option 1, with a warning than 2 but maybe other hubs will prefer to throw an error.

nickreich commented 3 months ago

For anyone following this issue, I am working on writing up a complete proposal that tries to address the issues brought up here on this incredibly long thread...

nickreich commented 3 months ago

I'm going to close this, as superceded by #70 .

hubverse-org / schemas

consider formalizing dependence structure for sample outputs #48

1. naive approach, existing tasks.json ideas

2. alternative proposal

2a. using "required" and "optional" entries similar to existing specifications

2b. more directly listing a min and max number of samples

Challenges

Ideas for solutions

Theme 1: specifying dependence structures in config files

Theme 2: distinguishing between types of dependence captured by samples using different output types

Other notes

hub-level dependence specification

model-submission level dependence specification

Overall take-aways about validations and information storage

Misc other note: is parameter_set just a task id?

Idea 1

Idea 2

Concrete suggestions

2a. using `"required"` and `"optional"` entries similar to existing specifications

Misc other note: is `parameter_set` just a task id?