Closed nickreich closed 6 months ago
Following up on this -- more generally, I think we should decide on how both the model name and the round id are represented (these are the two pieces of information that are encoded in file paths). Recall that in our current proposal at the link Nick provided just above, the model outputs are stored in files with the following organizational structure:
* model-output directory: e.g., Forecast data or scenario projections produced by participating models/teams
1. team1-modela
* <round-id1>.csv (or parquet, etc)
* <round-id2>.csv (or parquet, etc)
2. team1-modelb
* <round-id1>.csv (or parquet, etc)
3. team2-modela
* <round-id1>.csv (or parquet, etc)
I propose the following as a guiding principle: model output submission files should contain enough data (within the file) to uniquely identify different forecasts.
I think that this suggests the following rules about representation of models and round ids:
In all hubs, submission files should contain columns that uniquely identify the team/model that generated the model outputs.
a.
since it's easier to paste two columns together than to split them apart, but I could easily be convinced to go the other way too.
team_abbr
and the model_abbr
.team_model_abbr
column where the value is of the form <team_abbr>-<model_abbr>
team_abbr
and model_abbr
in the file.It is not required for hubs to put the round id
as a column in the submission file, but it is required that there are not multiple rounds that have the same combinations of values for all task id variables within one hub.
"round_id_from_variable": true
will have the variable corresponding to the round id in submission files anyways."round_id_from_variable": false
could always list the round id as a task id variable at their discretion (and this might be a good practice?).@elray1, this makes sense.
I think it might be good to state some explicit design principles. In this case, you're sacrificing file size to gain some unambiguous clarity. this is a reasonable tradeoff. I can imagine an effort that might make a different choice: I worked on a project once where metadata files was preserved by wrapping the entire data file in an XML block. This keeps the metadata close at hand, but makes life difficult for human access and editing. Talking about these decisions in terms of tradeoffs between parsimony/file size, data fidelity, and ease of use might be one way to frame these discussions.
Summing up a few misc. comments from in person conversation:
comments r.e. model/team name:
comments r.e. round id
Clarification. My suggestion was intended as including the team name in addition to the round name. Including both in both the file name and the file would sacrifice brevity while providing greater flexibility and clarity.
Here's another proposal for handling of model names, feeding off of comments in our meeting and a little more sidebar discussion later:
hubUtils::load_predictions
will produce as output data frames that are assured to have these columns after reading predictions into an environment like an R or python session.load_forecasts
function will be to standardize the representation of the data by collecting the model/team metadata and appending as columns in the data if necessary. Possible representations include the following (and I would be OK with picking one, likely just 2.a., to start with):
a. The original proposal, where the model/team name is encoded in file paths, but they do not necessarily appear within the file. (Need to do some more reading/experimentation to be sure I understand, but note that this may be supported already in parquet)
b. The alternate proposal I made above where data files actually have model/team abbreviations as columns
c. Possibly other schemes that could be introduced later. For example, Mike mentioned an idea about using a json format to store prediction data; such a json format might use another mechanism to encode the model/team name information, and a load_predictions
function could unpack that information into a column in a data frame.Still to discuss/decide:
load_predictions
use two columns (team_abbr
/team
and model_abbr
/model
) or one column (team_model_abbr
, team_model
, or maybe just model
) to capture this information?I agree too.
team_abbr
and model_abbr
sounds good)After further discussion, we decided on a single column for team and model information -- in practice, we will want a combined model and team identifier for purposes of plotting, scoring, and ensemble weighting, which are our more common operations with model output data. Having a single column with this information will simplify those tasks.
Unclear what is to be done. Should the following be added to documentation? Any specifics about where to add it?
@annakrystalli This is prompting a question for me. Should the recommendation be to never include a model-id column in the submitted file since it will always be implicit in the directory name? And if it were included then collect()
might include it twice, right? or is there an allowance for additional columns that are not specified in the schema and that could be removed prior to collect()
ion?
r.e. question just above -- if we want to ensure that submitted model output files are in the right folder, we may need this as a column in the submitted file or a part of the file name?
if we have decided on model id as the standard here (which concatenates a team abbreviation and model abbreviation), maybe we need to update the Model metadata
documentation to reflect this.
We recently discussed this in this question too in hubValidations
here: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/discussions/13
In hubValidations
I am assuming that file names follow the following pattern:
<round_id>-<team_abbr>-<model_abbr>
or <round_id>-<model_id>
Regarding including model_id
within files, that's fine. When opening a connection to the hub we would just need to set the partition
argument to NULL
. If not, I think you might get a cryptic error because the connection wouldn't know how to handle the model_id
column in the data in conjunction with the partitioning model_id
column hubUtils attempts to create from the partition.
ok, thanks. sounds like it's easier not to put this into the file name, at least for now.
@elray1 , you mean: "...easier not to put this into the FILE." right? We still want this information in the file name so we can check a file is being submitted to the correct folder?
yes, sorry, and thanks :)
This seems that it is solved (at least for now). Can I close this issue?
I suggest that we make one small update and then close this issue: on this page, when describing the template model metadata schema file, let's say that a hub's model metadata schema file should include either the single combined field model_id
, or both of the fields team_abbr
and model_abbr
.
Addressed in PR #101
From @elray1
There have been ongoing discussions as well about whether or not model name should be included in the file explicitly as well.