selection of round identifier

lmullany commented 2 months ago

Is the selection of the origin_epiweek, as the main round identifier, the right choice?

Currently, the origin_epiweek is the main "round" identifier, but there are a few other possible choices. (Almost) All submission files have two pieces of information embedded in the filename

EWXX where XX is the epi week
submission date

Some caveats:

The submission date in the file is sometime missing, and in a small number of cases I inferred this date from the pattern of submissions by that model/team around this epiweek. For example, lets say we have submission files that look like this (dates made up):
- EW46_model_2018_11_16.csv
- EW47_model.csv
- EW48_model_2018_11_30.csv

Then, I infer the submission date as 11/23 for EW47_model.csv. Note that whatever decision I make here currently has no bearing, because I'm not using the submission date to identify round, I'm only using the prefix EWXX, and that was not missing for any files

The "submission date" embedded in the filename may or may not align with when the file was submitted to the repo. I'm not sure there is any way to consistently identify when each file was actually submitted.
Some files for separate "rounds" i.e. EWXX values have the same "submission date" embedded in the file. For example
- EW02-KoT-adaptive-2020-2-5.csv
- EW03-KoT-adaptive-2020-2-5.csv
The "submission date" embedded in the file does not necessarily fall on a consistent day of the week. While some weekdays are more highly represented than others, all of course are possible, and its not clear to me if we could somehow leverage these to reference say a particular Saturday prior to submission, etc.

lmullany commented 2 months ago

There is a substantial discussion here: https://github.com/orgs/Infectious-Disease-Modeling-Hubs/discussions/7

Based on that discussion and calls, etc, I am going to change the tasks.json to include origin_date as the round identifier. I will create a column origin_date (based on the current origin_epiweek [e.g. 2015-45]; origin_epiweek will be retained as a column in the dataset. The files within the model-output/<team-model> subfolder will look like:

<yyyy-mm-dd>.parquet

annakrystalli commented 2 months ago

I would suggest we try and conform to standard hub convention of <round_id>-<model_id>.parquet as much as possible for filenames.

I know in a standard hub model_id = team_abbr-model_abbr which is not available for all files here. Some options:

We could just use whatever is available as model_id. This wouldn' cause any issues for accessing data with hubData but could affect cloud data transforms
We could just use @bsweger's suggestion of just duplicating information where we've got it.
A last suggestion could be that, in the deep dive into the model documentation to generate model metadata files, team_abbr and model_abbr is determined for all models and that info used to generate model_ids for each. This could easily be done at a later date.

lmullany commented 2 months ago

We can do that @annakrystalli , but we should update the documentation if that is what we want hubs to do.. The documentation indicates that the file should be <round-id>.parquet

lmullany commented 2 months ago

For now, I'll write the code that pushes the hubverse-formatted data to the repo so that it conforms to <round_id>-<model_id>.parquet, for now, using whatever we currently have in model_id. Later, when we create a look up table that maps season and model_id to team_abbr and model_abbr, I'll convert to that.

annakrystalli commented 2 months ago

We can do that @annakrystalli , but we should update the documentation if that is what we want hubs to do.. The documentation indicates that the file should be <round-id>.parquet

Good catch! That definitely needs to be updated!

lmullany commented 2 months ago

Closing via #22

hubverse-org / flusight_hub_archive

selection of round identifier #10