Define data types for sample `value` minimum and maximum properties

annakrystalli commented 1 year ago

Working and v1.0.0 I noticed that the schema for sample value minimum and maximum properties only has a description with no type or other specification.

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c73b02de6dc44af4251c90a239e47505e7a40f91/v1.0.0/tasks-schema.json#L946-L966

Could I get some clarity on what the value column for sample output types will hold? Is it target dependant?

nickreich commented 1 year ago

I'm going to answer your question with ones of my own.

Doesn't the enum array specify that the types will be either double or integer?

I guess in practice, I could envision for a categorical target there being "samples" that would be the character strings of one of the category labels. But it seems like here the vision has been that samples would only be numeric and the minimum and maximum would specify the range of allowable values.

More broadly, I wonder if there shouldn't be some fields in the target_metadata that specifies something like the allowable range of the variable, which might depend on the target type? E.g. it seems that this would be the place to specify a numeric range for a discrete or continuous target, or allowable categories for a categorical target?

nickreich commented 1 year ago

Leaving related documentation about target types and output types from Zoltar documentation: https://docs.zoltardata.com/targets/#valid-prediction-types-by-target-type

nickreich commented 1 year ago

A few more thoughts on this.

We currently have three main blocks in the tasks.json file.
- task_ids
- output_type
- target_metadata
- There is a kind of implicit link between the contents of output_type and target_metadata that I think we've talked about before, maybe here. The idea is that certain output_types are valid for some types of targets but invalid for others.
- Let's say you have a setting where you have two targets, for the model_tasks > target_metadata > target_type field one target is "nominal" and one is "continuous". Then, if you want them defined in the same task_id block they have to have the same output_types which might be constraining. E.g. you could maybe have a "mode" output type for both of them and maybe a "sample" (although I note that sample does not currently accept "character" data types), but other output_types, e.g. "mean" or "cdf" would be meaningless for the "nominal" target. So in a setting where you wanted or needed different output types for two different targets you'd need to make separate task_ids for them. I don't think this is necessarily a bad thing, but just wanted to say it out loud.
- But I guess another potential awkwardness of this is how you need to define parameters of output types in ways that apply across all targets in the given task_id block. E.g. let's say I have two continuous targets but they have different natural ranges, then those targets would have to be in separate task_id blocks, since their range is defined in the output_type block which doesn't have a way of differentiating between which target the range is applied to. I think this is fine, but it does mean that there is potentially quite a lot of duplicated data in the task_id chunk in this setting, as opposed to if those output type parameters were defined for each target within, say, target_metadata.
- Bottom line I think is that the current system works but it would maybe be more natural to have output_type be a property of a target (similar to how it is defined in Zoltar). This might mean repeating the same output_type info multiple times if all the targets in a task_id block have the same output_types. But it removes also means that different target_types could be contained within one task_id block.

annakrystalli commented 1 year ago

I think this is worth a chat at the next meeting as suggested on slack. What I'd particularly like to better understand (and document in hubDocs) is when one should consider introducing a new modeling task item.

On another note, the zoltar documentation has a lot of great content on targets that could be a great source for some more detailed documentation on targets in hubDocs.

annakrystalli commented 1 year ago

Regarding the original question, if we need flexibility for this particular argument, perhaps we could just add the additionalProperties: true keyword in the schema and provide instructions in the docs for how to encode checks for samples of different output_types in tasks.json files.

hubverse-org / schemas

Define data types for sample `value` minimum and maximum properties #37