hubverse-org / flusight_hub_archive

Hubversion of FluSight 1 (2015-2019)
MIT License
2 stars 1 forks source link

selection of round identifier #10

Closed lmullany closed 2 months ago

lmullany commented 2 months ago

Is the selection of the origin_epiweek, as the main round identifier, the right choice?

Currently, the origin_epiweek is the main "round" identifier, but there are a few other possible choices. (Almost) All submission files have two pieces of information embedded in the filename

Some caveats:

  1. The submission date in the file is sometime missing, and in a small number of cases I inferred this date from the pattern of submissions by that model/team around this epiweek. For example, lets say we have submission files that look like this (dates made up):
    • EW46_model_2018_11_16.csv
    • EW47_model.csv
    • EW48_model_2018_11_30.csv

Then, I infer the submission date as 11/23 for EW47_model.csv. Note that whatever decision I make here currently has no bearing, because I'm not using the submission date to identify round, I'm only using the prefix EWXX, and that was not missing for any files

  1. The "submission date" embedded in the filename may or may not align with when the file was submitted to the repo. I'm not sure there is any way to consistently identify when each file was actually submitted.

  2. Some files for separate "rounds" i.e. EWXX values have the same "submission date" embedded in the file. For example

    • EW02-KoT-adaptive-2020-2-5.csv
    • EW03-KoT-adaptive-2020-2-5.csv
  3. The "submission date" embedded in the file does not necessarily fall on a consistent day of the week. While some weekdays are more highly represented than others, all of course are possible, and its not clear to me if we could somehow leverage these to reference say a particular Saturday prior to submission, etc.

lmullany commented 2 months ago

There is a substantial discussion here: https://github.com/orgs/Infectious-Disease-Modeling-Hubs/discussions/7

Based on that discussion and calls, etc, I am going to change the tasks.json to include origin_date as the round identifier. I will create a column origin_date (based on the current origin_epiweek [e.g. 2015-45]; origin_epiweek will be retained as a column in the dataset. The files within the model-output/<team-model> subfolder will look like:

<yyyy-mm-dd>.parquet
annakrystalli commented 2 months ago

I would suggest we try and conform to standard hub convention of <round_id>-<model_id>.parquet as much as possible for filenames.

I know in a standard hub model_id = team_abbr-model_abbr which is not available for all files here. Some options:

lmullany commented 2 months ago

We can do that @annakrystalli , but we should update the documentation if that is what we want hubs to do.. The documentation indicates that the file should be <round-id>.parquet

lmullany commented 2 months ago

For now, I'll write the code that pushes the hubverse-formatted data to the repo so that it conforms to <round_id>-<model_id>.parquet, for now, using whatever we currently have in model_id. Later, when we create a look up table that maps season and model_id to team_abbr and model_abbr, I'll convert to that.

annakrystalli commented 2 months ago

We can do that @annakrystalli , but we should update the documentation if that is what we want hubs to do.. The documentation indicates that the file should be <round-id>.parquet

Good catch! That definitely needs to be updated!

lmullany commented 2 months ago

Closing via #22