hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 2 forks source link

Clarify thinking about required and optional model tasks and output types #13

Closed elray1 closed 1 year ago

elray1 commented 1 year ago

Currently, the required and optional values of output type ids can in effect also specify whether the corresponding output types as a whole are required: namely, a particular output type is required if it has at least one required type_id, and is optional otherwise. This may be confusing. Is there another way? See also the related discussion under issue #9.

Current proposed system

To explain the situation, we consider a series examples of hubs with varying modeling task specifications.

Example 1:

      "model_tasks": [
        {
          "task_ids": {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": [0.1, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows, obtained via a kind of expand_grid action across the different combinations of required values for the task id variables and required type_ids within each output type. Note that in this process, you could imagine first concatenating the output_types with the options for type_id values within each output_type, so that they are treated as a "unit" when the expand_grid happens. Then split them back into two columns. This is necessary to track the nesting of type_id values withing the specific output types.

 location horizon output_type type_id value
        a       1     median      NA   ...
        b       1     median      NA   ...
        a       2     median      NA   ...
        b       2     median      NA   ...
        a       1    quantile    0.25   ...
        b       1    quantile    0.25   ...
        a       2    quantile    0.25   ...
        b       2    quantile    0.25   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...
        a       1    quantile    0.75   ...
        b       1    quantile    0.75   ...
        a       2    quantile    0.75   ...
        b       2    quantile    0.75   ...

Example 2

Example 2 is the same as example 1, but it has only one required quantile level:

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.5],
                "optional": [0.1, 0.25, 0.75, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1        median      NA   ...
        b       1        median      NA   ...
        a       2        median      NA   ...
        b       2        median      NA   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...

Example 3

Example 3 is similar to examples 1 and 2, but now all of the quantile levels are specified as optional.

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": null,
                "optional": [0.1, 0.25, 0.5, 0.75, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1        median      NA   ...
        b       1        median      NA   ...
        a       2        median      NA   ...
        b       2        median      NA   ...

Example 4

Our final example is similar to example 1, but swaps the specification of ["NA"] and null values in the required and optional fields for the mean output type:

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": null,
                "optional": ["NA"]
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": [0.1, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1    quantile    0.25   ...
        b       1    quantile    0.25   ...
        a       2    quantile    0.25   ...
        b       2    quantile    0.25   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...
        a       1    quantile    0.75   ...
        b       1    quantile    0.75   ...
        a       2    quantile    0.75   ...
        b       2    quantile    0.75   ...

Summary and question for discussion

Summary: Under the current system, the required rows that a submission must minimally obtain are obtained by applying an expand_grid type of action to the task id variables and combinations of output types and type ids. This means that if there are no required values under the type_ids for a particular output type, a minimal submission does not need to include any rows with that output type. Effectively, this means that that output type is optional. Saying this again in different words: in this set up, a particular output type is required only if there is at least one value specified as required in the type_ids under that output type. This is illustrated in examples 3 and 4 above.

Every time this has come up, this use of required/optional values of a type_id to implicitly set the status of an output type has been non-intuitive. How can we resolve this? Three ideas:

  1. Change the representation of the output column so that it has the required and optional properties similar to the other columns. We would then perhaps check that the names of any additional properties currently under "output_types" match the values that were specified as required or optional for the output column. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.
  2. Some other higher level field indicating which output types are required and optional. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.
  3. Somehow more directly specify the concatenated/nested values of columns output_type and type_id (and any restrictions on value) as being required or optional.
  4. Lots of documentation.
annakrystalli commented 1 year ago

DECISION: Option 4 🎉

annakrystalli commented 1 year ago

I've been working on implementing some of the decisions we made within the schema, in particular with respect to type_id for the mean and median output types. I've taken the opportunity to include more detail in the description as, required and optional are a little bit of an awkward concept in the context of mean and median type id. See what you think!

I've created two branches:

Let me know which implementation you prefer.

elray1 commented 1 year ago

I don't have very strong feelings about this -- but maybe a weak preference for the first option because it's more consistent with the other specifictions, and doesn't use the oneOf construction, which I guess maybe fewer people would be familiar with? But I would be happy to accept your recommendation for what you prefer, if you like the other one.

nickreich commented 1 year ago

I have the same response as Evan.

annakrystalli commented 1 year ago

Thanks both! So the reason I prefer the oneOf specification is that it can check during schema validation that the combined values required and optional are valid so we don't end up in an ambiguous situations where, for example, both have been set to ['NA'] or null.

Having said that, if both are ['NA'], required could take precedent. It's just not very clean.

Now that I think of it though, is there a situation where in a given set of task_ids or rounds, one of the output types needs to be specified because it has been included in another round but should not be submitted, in which case null & null should be set for required and optional?

elray1 commented 1 year ago

Thanks -- that makes sense.

Is this situation of possibly-repeated values across the optional and required fields a more general thing that we need to address in either case? e.g., what if someone specifies "US" as both an optional location and a required location? We might want to either check for that as part of validating the tasks.json file, or document that in that case, the field will be required in practice.

For your last point -- in that case, I think that each round (and each task group within that round) only needs to include the output types that are required for that round (or task group). The output type column will still be included, we're just specifying which values of output types are required or optional within that column.