consider updating documentation about model output folder to use `model_id1`, `model_id2` and double check file names

hubverse-org / hubDocs

https://hubverse.io

4 stars 5 forks source link

consider updating documentation about model output folder to use `model_id1`, `model_id2` and double check file names #114

Closed elray1 closed 1 week ago

elray1 commented 2 months ago

Looking at this page: https://hubverse.io/en/latest/user-guide/model-output.html

Currently the folder and file structure are listed as follows:

team1-modela
- <round-id1>.csv (or parquet, etc)
- <round-id2>.csv (or parquet, etc)
team1-modelb
- <round-id1>.csv (or parquet, etc)
team2-modela
- <round-id1>.csv (or parquet, etc)

Two comments about this:

This implies a specific structure for model ids as <team_abbr>-<model_abbr>, but it may be clearer to just indicate here that the folder names correspond to model_ids, and we can discuss conventions about composition of model_id elsewhere.
Do file names have the format <round_id>.csv, or <round_id>-<model_id>.csv? I think we've decided to include model_id as a check that submissions landed in the right folder, but I'm not sure.

nickreich commented 2 months ago

I think that we have elsewhere indicated that <model_id> == <team_abbr>-<model_abbr> and that teams can choose one representation to use, as indicated in their model metadata schema file.
I don't recall the specifics of that decision, but I support <round_id>-<model_id>.csv or .parquet as a file format.

bsweger commented 2 months ago

Thanks for raising this!

My .02 on the first question, mostly from the perspective of how we'll move hub data to the cloud and open it up to a non-hubverse audience.

This implies a specific structure for model ids as -, but it may be clearer to just indicate here that the folder names correspond to model_ids, and we can discuss conventions about composition of model_id elsewhere.

Removing the separate model-abbr and team-abbr columns from the "cloud transformed" model-output files in favor of a single model_id column simplifies the data conversion process. It does put the onus of parsing out team/model on data consumers, but I think it makes sense to favor the simple approach and revisit if we get feedback.

bsweger commented 2 months ago

I don't recall the specifics of that decision, but I support -.csv or .parquet as a file format.

Agree with @nickreich's comment re: item 2 (especially if we agree to make YYYY-MM-DD the required format for round_id, since that creates a definitive way to parse out round and model from a model-output filename).

Again, this is from the perspective of a cloud-enabled hub. While model_id could be obtained via "directory" structure or from a column in the actual file, I can see how it would be handy to have that information encoded in the filename, especially if people lose the directory structure context when downloading data.

bsweger commented 2 months ago

It's been a week since anyone has chimed in, so I'm going to assume that we'll proceed with @nickreich and @elray1's suggestions above:

model-output filenames will be in format <round_id>-<model_id> format
instead of trying to parse out model name and team name separate, the function that transforms cloud-based model-output files will instead generate a single column called model_idthat contains anything after round_id in the filename

#2 reflects hubverse-transform work to address the latter.

mzorn-58 commented 1 month ago

This page: now shows structure as

OK to close issue? @elray1 @nickreich

mmkerr commented 3 weeks ago

this is similar to an issue Anna raised in closed issue #116

nickreich commented 1 week ago

Agree that this can be closed.