hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 2 forks source link

Handle mean and median type_id specifications more efficiently? #9

Closed annakrystalli closed 1 year ago

annakrystalli commented 1 year ago

At the moment, for mean and median type_id, we are asking for both required and optional value specification. In type_id which will eventually be NA in R, we are also allowing for either NA or NULL to be supplied.

Given the type_id of mean and median must always be NA and required and optional has no meaning in the context of either type_id or value, should we simplify the schema structure to:

"mean": {
                      "type": "object",
                      "description": "the mean of the predictive distribution",
                      "properties": {
                        "type_id": {
                          "description": "Not used for mean output type. Must be NA or null.",
                              "type": "array",
                              "items": {
                                "enum": ["NA"],
                                "maxItems": 1
                              }
                            },
                        "value": {
                          "type": "object",
                          "properties": {
                            "type": {
                              "type": "string",
                              "enum": ["numeric", "double", "integer"]
                            },
                            "minimum": {
                              "type": "integer"
                            },
                            "maximum": {
                              "type": "integer"
                            }
                          },
                          "required": [
                            "type"
                          ]
                        }
                      },
                      "required": [
                        "type_id",
                        "value"
                      ]
                    }

Alternatively, we can make type_id optional all together for mean and median and ignore it by default in R. That would also get round the awkwardness of having to specify "NA" within a one element array so that R will automatically convert it to NA when reading in.

A final option is to require type_id for mean & median to always be null (again getting rid of the awkward array) in tasks.json and replace it with NA in R once it's read in?

nickreich commented 1 year ago

I don't feel like I know enough about JSON conventions to have strong feelings about this one way or another. I guess my strongest feeling about it all is that we should strive to make the tasks.json file that users have to write be as simple as possible, even if it means writing some additional custom code to check/add NAs that aren't present in the actual file. But this is not a strongly held conviction on this.

elray1 commented 1 year ago

Under this proposal, is there a way for a hub to specify that a mean prediction is accepted, but not required? For example, with the current set up (which I agree is awkward):

I agree that this is both verbose and kind of confusing, so I'm potentially open to defining another way of setting this up -- but (a) I do think we need the ability to specify whether each output type is required or optional, and (b) although I'm not super happy with the current solution, it does at least have the merit of consistency with the other types...

annakrystalli commented 1 year ago

Personally your suggestion @elray1 feels a bit inconsistent as we've been using required and optional to determine valid values of properties rather than the requirement of the property itself. Using "NA" and null to specify optional vs required also feels a bit confusing.

Perhaps using the json schema way of including a "required" property containing a vector of property names that are required might be cleaner?

Feels like this might need a bit of discussion at our next meeting?

elray1 commented 1 year ago

Further discussion sounds good. I agree that currently, the primary/most obvious effect of required and optional is to specify valid values, and as an unclear byproduct, we also end up specifying whether the property as a whole is required (if it has at least one required value) or optional (if it does not have at least one required value). Separating those things would make it much clearer!