hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 3 forks source link

Handle `round_id`s other than dates? #81

Open annakrystalli opened 2 months ago

annakrystalli commented 2 months ago

Background

Since the beginning of the project we have discussed and in general planned for supporting round_ids other than dates. So far this has not been necessary and our validations have focused on the assumption that round_ids will be dates, largely because all known hubs do indeed use dates as round IDs. Some discussing around this topic during development of the validation framework can be found here: https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations/discussions/13

However, the push to convert historical hubs to hubverse style hubs has resurfaced the question of supporting non-date round ids and the need to assess implementation implications and weigh them against the benefits of supporting this feature.

Implications of using non-dates as round ids

Use of a non-date round id has some important implications, most importantly on how submission windows are configured:

Work Required

If we do choose to go ahead and support non-date round IDs, the main work would be in modifying hubValidations::parse_file_name() to recognise and match non date round IDs.

If we decide we will not support non-date round IDs, we need to update hubDocs to reflect that.

nickreich commented 2 months ago

I would be in favor of not adding support for non-date round-ids for now, and only supporting round-ids that are in the format of dates. Are there clear usecases where supporting non-date round-ids would be useful?

annakrystalli commented 2 months ago

@LucieContamin wrote in https://github.com/orgs/Infectious-Disease-Modeling-Hubs/discussions/7#discussioncomment-9236827

I am not sure I totally understand the issue here, sorry. But, for SMH, we mainly use origin_date as round_id . However, we have some rounds where the round_id is not the origin_date, and is only use in the filename, to be able to tag which file correspond to which round. In this case, the format of round_id does not matter a lot. We still use a YYYY-MM-DD format to follow the same "style" as the other round. Does that answer your question? or help?

Could you share an example of what such round_ids look like, as it does matter what they contain in that we need to be able to consistently parse round_id from model_id in filenames so how we do that can be made easier or harder by whether we follow certain conventions in how we specify round_ids (if they are not dates).

Additionally, in the rounds where round_ids are not the origin date, what value does origin date contain in the files?

Would be super curious to see an example of both the tasks.json and some files (including filenames) of what you describe!

LucieContamin commented 2 months ago

The round_id we are using is still in the ISO Date format: "YYYY-MM-DD", for example:

"round_id": "2024-05-15",
      "round_id_from_variable": false,
      "model_tasks": [
        {
          "task_ids": {
            "origin_date": {
              "required": ["2020-11-15"],
              "optional": null
            }, ....

So, the filename follow the "usual" format, for example: model-output/team2-modelb/2024-05-15-team2-modelb.gz.parquet.

I am happy to provide more information and example, if necessary. I can also give you the link to the repository link to these rounds: https://github.com/midas-network/covid19-smh-research

annakrystalli commented 2 months ago

Thank you @LucieContamin !

OK so it still is a date so still not an example of a non date round_id! 😜

Out of curiosity, what made you configure some rounds one way and some the other?

LucieContamin commented 2 months ago

Ah yes, still a date but as I use it only for tracking files, it could have been anything I guess. It's not use for anything else.

We decided to configure it like this, because we have two rounds with the same origin_date so we needed to use something else for round_id.

annakrystalli commented 2 months ago

Very useful context, thanks. I guess if we were to support non-date round ids, so long as they conformed to using round id that only contain alphanumerics and _, I believe our current systems would work (see deep dive here).

And you still have origin_date in your files so you have dates to match to target data and plot. It's when that date information is not included that issues can arise.