Closed lmullany closed 2 months ago
There is a substantial discussion here: https://github.com/orgs/Infectious-Disease-Modeling-Hubs/discussions/7
Based on that discussion and calls, etc, I am going to change the tasks.json to include origin_date
as the round identifier. I will create a column origin_date
(based on the current origin_epiweek
[e.g. 2015-45]; origin_epiweek
will be retained as a column in the dataset. The files within the model-output/<team-model>
subfolder will look like:
<yyyy-mm-dd>.parquet
I would suggest we try and conform to standard hub convention of <round_id>-<model_id>.parquet
as much as possible for filenames.
I know in a standard hub model_id
= team_abbr-model_abbr
which is not available for all files here. Some options:
model_id
. This wouldn' cause any issues for accessing data with hubData
but could affect cloud data transformsteam_abbr
and model_abbr
is determined for all models and that info used to generate model_id
s for each. This could easily be done at a later date.We can do that @annakrystalli , but we should update the documentation if that is what we want hubs to do.. The documentation indicates that the file should be <round-id>.parquet
For now, I'll write the code that pushes the hubverse-formatted data to the repo so that it conforms to <round_id>-<model_id>.parquet
, for now, using whatever we currently have in model_id
. Later, when we create a look up table that maps season and model_id to team_abbr and model_abbr, I'll convert to that.
We can do that @annakrystalli , but we should update the documentation if that is what we want hubs to do.. The documentation indicates that the file should be
<round-id>.parquet
Good catch! That definitely needs to be updated!
Closing via #22
Is the selection of the origin_epiweek, as the main round identifier, the right choice?
Currently, the
origin_epiweek
is the main "round" identifier, but there are a few other possible choices. (Almost) All submission files have two pieces of information embedded in the filenameSome caveats:
Then, I infer the submission date as 11/23 for
EW47_model.csv
. Note that whatever decision I make here currently has no bearing, because I'm not using the submission date to identify round, I'm only using the prefixEWXX
, and that was not missing for any filesThe "submission date" embedded in the filename may or may not align with when the file was submitted to the repo. I'm not sure there is any way to consistently identify when each file was actually submitted.
Some files for separate "rounds" i.e. EWXX values have the same "submission date" embedded in the file. For example
The "submission date" embedded in the file does not necessarily fall on a consistent day of the week. While some weekdays are more highly represented than others, all of course are possible, and its not clear to me if we could somehow leverage these to reference say a particular Saturday prior to submission, etc.