Create `validate_model_metadata()` function

hubverse-org / hubValidations

Testing framework for hubverse hub validations

https://hubverse-org.github.io/hubValidations/

Other

1 stars 4 forks source link

Create `validate_model_metadata()` function #7

Closed annakrystalli closed 1 year ago

annakrystalli commented 1 year ago

Overview

This function will check correctness of model metadata files. The checklist was compiled from the compiled checks spreadsheet across existing hubs and may be superceded by current hubverse practices (and therefore require some updating). I thing the core of required functionality is there though.

Each of the following checks requires it's own check_meta_*() function that returns the output of capture_check_cnd()

[ ] There is only one metadata file for each team-model
[x] Metadata file is in the 'model-metadata' folder
[x] Metadata file is using the .yml or .yaml extension
[x] Metadata filename is the same as model_id
[x] Metadata schema file exists in hub
[x] Metadata file is consistent with schema specifications
[ ] There is only one "primary"-designated model for a given team

elray1 commented 1 year ago

Noting that I think we've said these files would have the more specific .yml or .yaml file extension

elray1 commented 1 year ago

The last item on our scoping list ("There is only one "primary"-designated model for a given team") is outdated — it says that we want to check that there is only one primary model per team, but we’ve said we would do away with the generic/nebulous primary/secondary designations in favor of more function-specific designations:

  include_viz:
    description: >
      Indicator for whether the model should be included in the
      Hub's visualization
    type: boolean
  include_ensemble:
    description: >
      Indicator for whether the model should be included in the
      Hub's ensemble
    type: boolean
  include_eval:
    description: >
      Indicator for whether the model should be scored for inclusion in the
      Hub's evaluations
    type: boolean[aw][ax][ay]

Q: do we want to have any default checks on the number of models per team that are included in the evaluation, viz, or ensemble? Or hub-specific config settings to specify this?

elray1 commented 1 year ago

three questions about versioning metadata schemas:

are we recommending or requiring that hubs version their model metadata schemas?
if so, should we enforce a naming convention like hub-config/model-metadata-schema-v0.0.1.json? would we recommend that older versions be preserved in case of any updates?
and check that the submitted model metadata file matches the latest available schema provided by the hub?

nickreich commented 1 year ago

unless we see a clear use-case for it right now, I would suggest starting simple and not versioning the metadata schema.

Again, I'd suggest maybe starting simple and not including default checks for number of models per team, but we could always add them later.

sbfnk commented 1 year ago

On team designation, number of models etc. I wonder if this question should be taken out of the metadata altogether. I would imagine that in the future hubs might have different criteria for inclusion (based on past performance etc.) and scoring as well as a specific onboarding processes. Ultimately I think any decision on whether a model is scored and/or included in the ensemble should be down to the hub maintainers rather than individual teams.

On versioning I agree that we should keep things simple, however I think it would be helpful to be able to keep track of any changes made to a model (where soliciting and providing a platform for code submissions rather than results submissions would be one potential way of ensuring this).

elray1 commented 1 year ago

r.e. team designations -- I think there was a need for this in the US covid forecast hub (we had a team submitting ~7 variations on the same model for a while, and we wanted them to pick one), but this also seems potentially specific to that situation, and other hubs might want to handle it differently. So it makes sense to me to keep the checks that are done by default fairly limited, and then allow hubs to add to that if they want. So the proposal is that our tools will not say what a hub has to collect in their model metadata files, we will just check that the metadata file exists and matches whatever was in the hub's model-metadata-schema.json.

And with that in place, we could allow hubs to do whatever they want in their metadata files to track changes to model methods (for example, just relying on github file version history or adding in some metadata structure that allows for per-round model details)

elray1 commented 1 year ago

Related thread here thinking about validating minimal required fields in the model-metadata-schema.json file that a hub sets up.