hubverse-org / hubDocs

https://hubverse.io
5 stars 6 forks source link

decide on where to include team/model name information. #2

Closed nickreich closed 6 months ago

nickreich commented 1 year ago

From @elray1

Note that the proposal [i.e. the current documentation] here does not include the team and model name in the file name. To discuss:

  • advantages:
  • eliminates redundant information
  • means we do not need to do any regular expression parsing of file names to extract information about team name and round id
  • disadvantages:
  • removes an opportunity to validate that submissions are in the correct folder (e.g., team didn't mix up forecasts from different models they are responsible for, team didn't drop their forecasts into someone else's folder)

There have been ongoing discussions as well about whether or not model name should be included in the file explicitly as well.

nickreich commented 1 year ago

relevant to content e.g. here

elray1 commented 1 year ago

Following up on this -- more generally, I think we should decide on how both the model name and the round id are represented (these are the two pieces of information that are encoded in file paths). Recall that in our current proposal at the link Nick provided just above, the model outputs are stored in files with the following organizational structure:

* model-output directory: e.g., Forecast data or scenario projections produced by participating models/teams
   1. team1-modela
      * <round-id1>.csv (or parquet, etc)
      * <round-id2>.csv (or parquet, etc)
   2. team1-modelb
      * <round-id1>.csv (or parquet, etc)
   3. team2-modela
      * <round-id1>.csv (or parquet, etc)

I propose the following as a guiding principle: model output submission files should contain enough data (within the file) to uniquely identify different forecasts.

I think that this suggests the following rules about representation of models and round ids:

  1. In all hubs, submission files should contain columns that uniquely identify the team/model that generated the model outputs.

    • This would mean that if we load forecasts/projections from two different models and "row bind" them, it will be guaranteed that there are no duplicate rows from different models. Additionally, note that we will often need the model in common workflows, e.g. plotting forecasts from different models or building trained ensembles. We will therefore typically want to have the model in data objects with model outputs, and having them stored as data in the submission files means that we don't need to insert special logic to handle this when loading model outputs.
    • I see two options for what this could look like. I propose that we pick one of them and require all hubs to use this standard. I prefer option a. since it's easier to paste two columns together than to split them apart, but I could easily be convinced to go the other way too.
      1. Separate columns for the team_abbr and the model_abbr.
      2. A single team_model_abbr column where the value is of the form <team_abbr>-<model_abbr>
    • In either case, we could validate that the file is in the right folder since the folder name should match the team_abbr and model_abbr in the file.
    • A limitation of this proposal is that it means submission files are larger than necessary, since they contain the same value duplicated many times. I have two responses to this criticism: (1) I think the cost in increased file size is worth the gain in clarity of data representation; and (2) a hub that is concerned about file sizes could use a file format like parquet that has compressed representation of duplicated values in columns.
  2. It is not required for hubs to put the round id as a column in the submission file, but it is required that there are not multiple rounds that have the same combinations of values for all task id variables within one hub.

    • This would mean that if we load forecasts/projections from two different rounds and "row bind" them, it will be guaranteed that there are no duplicate rows from different rounds.
    • This would also mean that if we ever needed to reconstruct the round id corresponding to a particular model output row, we could do so by looking at the task id variable/value combinations for different rounds and identifying which round had the combination that shows up in the model output row we're inspecting.
    • In practice, forecast hubs that use "round_id_from_variable": true will have the variable corresponding to the round id in submission files anyways.
    • Hubs that use "round_id_from_variable": false could always list the round id as a task id variable at their discretion (and this might be a good practice?).
harryhoch commented 1 year ago

@elray1, this makes sense.

I think it might be good to state some explicit design principles. In this case, you're sacrificing file size to gain some unambiguous clarity. this is a reasonable tradeoff. I can imagine an effort that might make a different choice: I worked on a project once where metadata files was preserved by wrapping the entire data file in an XML block. This keeps the metadata close at hand, but makes life difficult for human access and editing. Talking about these decisions in terms of tradeoffs between parsimony/file size, data fidelity, and ease of use might be one way to frame these discussions.

elray1 commented 1 year ago

Summing up a few misc. comments from in person conversation:

comments r.e. model/team name:

comments r.e. round id

harryhoch commented 1 year ago

Clarification. My suggestion was intended as including the team name in addition to the round name. Including both in both the file name and the file would sacrifice brevity while providing greater flexibility and clarity.

elray1 commented 1 year ago

Here's another proposal for handling of model names, feeding off of comments in our meeting and a little more sidebar discussion later:

  1. we define a standard prediction data object format that includes the information about the model/team as columns (TBD whether this is one or two columns). Functions like hubUtils::load_predictions will produce as output data frames that are assured to have these columns after reading predictions into an environment like an R or python session.
  2. we support multiple options for back-end storage of these data. A particular hub may pick one of these options, and a task for a load_forecasts function will be to standardize the representation of the data by collecting the model/team metadata and appending as columns in the data if necessary. Possible representations include the following (and I would be OK with picking one, likely just 2.a., to start with): a. The original proposal, where the model/team name is encoded in file paths, but they do not necessarily appear within the file. (Need to do some more reading/experimentation to be sure I understand, but note that this may be supported already in parquet) b. The alternate proposal I made above where data files actually have model/team abbreviations as columns c. Possibly other schemes that could be introduced later. For example, Mike mentioned an idea about using a json format to store prediction data; such a json format might use another mechanism to encode the model/team name information, and a load_predictions function could unpack that information into a column in a data frame.

Still to discuss/decide:

  1. Does this proposal represent enough of a consensus that we can move ahead with it?
  2. Should the output of load_predictions use two columns (team_abbr/team and model_abbr/model) or one column (team_model_abbr, team_model, or maybe just model) to capture this information?
  3. What is our thinking r.e. Mike J's example of a WNV forecast hub with the same forecast targets across multiple rounds? I think my opinion is that this is an example of a place where it would be a good idea to force the hub to have something like an "origin date" or "due date" column in the forecast files, so that after we load the forecast files from multiple different submission rounds in, the data about when the forecasts were made are in there?
nickreich commented 1 year ago
  1. I vote yes, to move ahead.
  2. I would vote for two columns.
  3. I concur with Evan's suggestion about having an "origin_date" column.
LucieContamin commented 1 year ago

I agree too.

  1. Yes, I think this proposal answers all our questions and represents a good consensus
  2. I would also vote for 2 columns (team_abbr and model_abbr sounds good)
  3. I agree with your suggestion to add a column storing this information but I think we need a precise description of all the different columns possible for date information, to avoid confusion and maybe avoid having 2 hubs with the same column name but with different meaning for the value (or at least not without good documentation)
elray1 commented 1 year ago

After further discussion, we decided on a single column for team and model information -- in practice, we will want a combined model and team identifier for purposes of plotting, scoring, and ensemble weighting, which are our more common operations with model output data. Having a single column with this information will simplify those tasks.

mzorn-58 commented 1 year ago

Unclear what is to be done. Should the following be added to documentation? Any specifics about where to add it?

nickreich commented 1 year ago

@annakrystalli This is prompting a question for me. Should the recommendation be to never include a model-id column in the submitted file since it will always be implicit in the directory name? And if it were included then collect() might include it twice, right? or is there an allowance for additional columns that are not specified in the schema and that could be removed prior to collect()ion?

elray1 commented 1 year ago

r.e. question just above -- if we want to ensure that submitted model output files are in the right folder, we may need this as a column in the submitted file or a part of the file name?

elray1 commented 1 year ago

if we have decided on model id as the standard here (which concatenates a team abbreviation and model abbreviation), maybe we need to update the Model metadata documentation to reflect this.

annakrystalli commented 1 year ago

We recently discussed this in this question too in hubValidations here: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/discussions/13

In hubValidations I am assuming that file names follow the following pattern: <round_id>-<team_abbr>-<model_abbr> or <round_id>-<model_id>

Regarding including model_id within files, that's fine. When opening a connection to the hub we would just need to set the partition argument to NULL. If not, I think you might get a cryptic error because the connection wouldn't know how to handle the model_id column in the data in conjunction with the partitioning model_id column hubUtils attempts to create from the partition.

elray1 commented 1 year ago

ok, thanks. sounds like it's easier not to put this into the file name, at least for now.

annakrystalli commented 1 year ago

@elray1 , you mean: "...easier not to put this into the FILE." right? We still want this information in the file name so we can check a file is being submitted to the correct folder?

elray1 commented 1 year ago

yes, sorry, and thanks :)

micokoch commented 7 months ago

This seems that it is solved (at least for now). Can I close this issue?

elray1 commented 7 months ago

I suggest that we make one small update and then close this issue: on this page, when describing the template model metadata schema file, let's say that a hub's model metadata schema file should include either the single combined field model_id, or both of the fields team_abbr and model_abbr.

micokoch commented 6 months ago

Addressed in PR #101