hubverse-org / hubData

Tools for accessing and working with hubverse Hub data
https://hubverse-org.github.io/hubData/
Other
3 stars 4 forks source link

Schema per round? #16

Closed LucieContamin closed 3 months ago

LucieContamin commented 6 months ago

The hubData::create_hub_schema() function creates a arrow schema object with the information of all the rounds in the hub, which is perfect when we want to connect to all the rounds in the hub.

However, it might be interesting to have a specific round schema, especially if you have different columns and columns between rounds.

For example, if we have 2 rounds:

{
  "schema_version": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/v2.0.0/tasks-schema.json",
  "rounds": [
        {
      "round_id": "2024-03-26",
      "round_id_from_variable": false,
      "model_tasks": [
        {
          "task_ids": {
            "origin_date": {
              "required": ["2020-11-15"],
              "optional": null
            },
             "scenario_id": {
              "required": ["A-2020-05-01"],
              "optional": null
            },
            "location": {
              "required": ["06"],
              "optional": null
            },
            "race_ethnicity": {
                "required": ["latino", "asian", "black", "white", "other"],
                "optional": null
            },
            "target": {
              "required": ["inc death"],
              "optional": null
            },
            "horizon": {
              "required": [1, 2, 3, 4],
              "optional":null
            }
          },
          "output_type": {
            "quantile": {
              "output_type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": null
              },
              "value" : {
                "type": "double",
                "minimum": 0
              }
            }
          },
          "target_metadata":[
            {...}
          ]
        }
      ],
      "submissions_due": {
        "start": "2024-05-26",
        "end": "2024-06-26"
      }
    },
    {
      "round_id": "origin_date",
      "round_id_from_variable": true,
      "model_tasks": [
        {
          "task_ids": {
            "origin_date": {
              "required": ["2024-04-28"],
              "optional": null
            },
             "scenario_id": {
              "required": ["A-2024-03-01"],
              "optional": null
            },
            "location": {
              "required": null,
              "optional": ["US"]
            },
            "target": {
              "required": null,
              "optional": ["peak time hosp"]
            },
            "horizon": {
              "required": null,
              "optional": null
            },
            "age_group": {
              "required": ["0-130"],
              "optional": null
            }
          },
          "output_type": {
            "cdf":{
                "output_type_id":{
                    "required":["EW202418", "EW202419", "EW202420", "EW202421"],
                   "optional":null
                 },
            "value":{                   
                   "type":"double",
                   "minimum":0,
                   "maximum":1
                }
             }
          },
          "target_metadata": [
            {...}
          ]
        }
      ],
      "submissions_due": {
        "start": "2023-04-17",
        "end": "2023-05-17"
      },
      "partition": ["origin_date", "target"]
    }
  ]
}

The schema will returns the all the possible columns, and if we are, for example, interested only in round 2 and don't want to have additional empty column, you need to know that you need to remove race_ethnicity. Or if you just call a function to remove all the empty columns, you will also lose horizon which might not be wanted.

So I wonder if it makes sense to have either an additional function or a parameter to generate the schema only of the round of interest.

nickreich commented 6 months ago

I wonder if this is more detailed/complex functionality than we want to be supporting right now. The supporting of round-specific columns feels like it opens up a lot of complexity that may stray from the core functionality that we still need to build out basic support for.

LucieContamin commented 6 months ago

Yes I agree. I hesitated to open this issue but I thought it might be good to have someone else point of view on that question.