Closed annakrystalli closed 2 days ago
I'm totally onboard with this! 💯
The non-interlanguage compatibility of NA
is something that's been brought up in the past, especially by @LucieContamin, discussed, but somewhat shelved until the bridge needed crossing. I agree this seems like a very good opportunity to cross that bridge.
I think we could easily even go one step cleaner and rather than a [null]
array, just go for a null
property all together (i.e. required: null
. In terms of implementation, expand_model_out_grid()
already has functionality to convert task ID optional
and required
properties which are both null
(used for task IDs which are relevant to one modeling task but might not be to another in the same round) to NA
s so I think it would be straightforward to apply that to output_type_id
s too.
Other areas that would require work would be:
Overall, I think it's much cleaner, clearer and would make it much easier to communicate and explain the expectations for point estimate output type ids so worth the effort!
I basically agree conceptually with the arguments. Could someone explain what a change like this would mean for what a submission file would look like? I.e. what would be included (if anything) in the output_type_id column when output_type="mean"?
What would the implications be for existing hubs that might try to migrate to this new schema format? E.g. are we at an early enough stage with the variant hub that we could just make this change and edit a few files and it would be fine? Does FluSight solicit mean forecasts and would they be stuck with schema 3.x?
Nothing should really change for submissions, either current or future. We are still using R to validate so any missing values used should translate to NA
s when read into R. This is the case for, e.g. values generated in python with pd.NA
.
So from my perspective, if R is interpreting the values as NA
, there should be no problems with validation or with opening datasets as a whole. I think this will be even more stable with parquet files as missing values are encoded in the parquet format and therefore are fully interoperable regardless of the framework generated. Having said the above it may well be wise to experiment with csv a bit to ensure mixing and matching such values from different framework doesn't cause problems, especially downstream when trying to open a dataset.
I've added some suggestions for testing in https://github.com/hubverse-org/hubDocs/issues/198
@zkamvar posted the following note on #103
Something that was brought up in response to https://github.com/reichlab/variant-nowcast-hub/pull/117#issuecomment-2423170370 is that the
"NA"
is a bit confusing because it sure looks like a character, but when we expand the grid theoutput_type_id
columns becomeNA
(which is an intentional move by Ooms described in section 2.1.1 of the JSONlite package paper)Now that we are using
is_required
for point estimate types, we might be able to take this opportunity to set therequired
property to a single elementnull
array. This will have exactly the same result as the"NA"
array, but with the following advantages:null
is a concept that even JSON can understandThis is what I think it would look like in the schema:
Demo
Here's a demo that shows that
["NA"]
and[null]
are equivalent by modifying a tasks.json file and reading them in with jsonliteCreated on 2024-10-18 with reprex v2.1.1
Originally posted by @zkamvar in https://github.com/hubverse-org/schemas/issues/103#issuecomment-2423356765