hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 2 forks source link

Does `required` and `optional` apply to sample output types #12

Closed annakrystalli closed 1 year ago

annakrystalli commented 1 year ago

Similar to #9 & #10 , Does it make sense to have required and optional for sample output types? If so would be great to have an example of what a reasonable:

"example": [{
           "required": [],
           "optional": []
           }]

example would be?

elray1 commented 1 year ago

I think it does -- for example, suppose a hub wants to require that submissions include at least 100 samples and up to 1000 samples. Then they might have a specification like this:

          "output_types": {
            "sample": {
              "type_id": {
                "required": [1, 2, 3, ..., 100],
                "optional": [101, 102, 103, ..., 1000]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              } 
            }
          }
annakrystalli commented 1 year ago

Copying useful context here:

Following up here on one piece from the discussion on the PR, because I thought it was maybe easy to lose track of it there. This is about minimum and maximum values for the sample output type.

  • The idea here is that samples are samples from a distribution. So for example, suppose someone's predictive distribution is Normal(5, 1). Then the samples would be draws from that distribution, e.g., they would have a mean of about 5.
  • I think hubs will want the ability to specify minimum and maximum values for samples (e.g., at least 0)
  • But in some applications, the minimum and maximum values for samples may be negative or greater than 1 since the samples themselves are not probabilities; they are on the scale of the variable that's being modeled

Given value will be a draw from the predictive distribution which as you state could be any number depending on the scale of the modelled variable , should the type not be numeric or double instead of integer?

elray1 commented 1 year ago

I think we should allow the hub to specify this, with the possible specifications being numeric, double, or integer. For example, if the target variable is an integer count of hospitalizations, the hub may require that samples be integers (e.g., obtained either directly as draws from an integer-valued distribution like the negative binomial or by discretizing an underlying continuous distribution). But there may of course be other examples where the variable is inherently continuous, such as rates of disease incidence per 100,000 population -- so in different settings, either an integer or double data type may be appropriate.