Unclear where horizon units are specified

annakrystalli commented 1 year ago

In the Docs, Horizon is defined as such:

Horizons define the difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months)

However, I'm unclear where the units for the horizon are being specified in the metadata.

nickreich commented 1 year ago

Horizons time units need to be specified at the level of a specific target, not a hub. E.g. for the US COVID-19 Forecast Hub we have day ahead hosp forecasts and week ahead everything else. 🤯

Maybe this suggests that we need an additional set of target-specific info somewhere in the JSON? I can't quite wrap my head around where it should live in the JSON though.

elray1 commented 1 year ago

A question I have is whether there is any purpose for which we really need to have this recorded formally as metadata? The only use case I thought of immediately was to calculate target end date for the purpose of some generic viz tool, but that does not feel high priority -- maybe there are others? Note that we have currently written that Hubs are responsible for providing a function mapping truth data into a data frame with an observed value per task id combination for the purpose of facilitating scoring, which is the other place where I could see this cropping up -- so it's not clear to me that we need a generic answer with this information recorded in the metadata?

The thinking below assumes that we decide we do want to record this information in the json metadata files.

The first thing I thought of is something I think we should not do: I think we should not track the horizon units as a property of the horizon specification. e.g., in the complex scenario hub example here, I think we should not add a units field to obtain a structure like

            "horizon": {
              "required": {"ref": "#/$defs/task_ids/horizon_12"},
              "optional": {"ref": "#/$defs/task_ids/horizon_1326"}
              "units": "days"
            }

The reason this doesn't work is that a hub may decide to encode their dates using the pair of fields (origin_date, target_end_date), and then we'd lose the ability to track the time units involved.

So, what if instead we introduce one or two new properties at the level of "task_ids" and "output_type", on a per-task-group basis. The proposed new entries are (to be refined and maybe renamed...):

          "is_single_time_unit_forecast": boolean indicating whether the forecast tasks defined in this group are for the value of a target variable at a single time point (e.g., day, week, or month)
          "time_units": string specifying time units

The purpose of is_single_time_unit_forecast is to determine whether the forecast tasks defined in this group correspond to what we have sometimes referred to as "short term" forecasts, e.g. of daily or weekly incidence, or rather some kind of summary across multiple time points, e.g., season peak incidence or timing.

In an example where a Hub was collecting forecasts of variables at different time scales, they would have to specify two different task id groups, e.g. one group for weekly cases and deaths with "time_units": "week" and a second group for daily hospitalizations with "time_units": "day".

elray1 commented 1 year ago

Cross-referencing this other issue where Nick proposes a broader set of things to keep track of.

hubverse-org / schemas

Unclear where horizon units are specified #8