hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 3 forks source link

Null point estimate output type IDs in configs instead of `NA`s #109

Closed annakrystalli closed 2 days ago

annakrystalli commented 1 month ago

@zkamvar posted the following note on #103

Something that was brought up in response to https://github.com/reichlab/variant-nowcast-hub/pull/117#issuecomment-2423170370 is that the "NA" is a bit confusing because it sure looks like a character, but when we expand the grid the output_type_id columns become NA (which is an intentional move by Ooms described in section 2.1.1 of the JSONlite package paper)

Now that we are using is_required for point estimate types, we might be able to take this opportunity to set the required property to a single element null array. This will have exactly the same result as the "NA" array, but with the following advantages:

  1. inter-language compatibility can be achieved since null is a concept that even JSON can understand
  2. we can clearly communicate that this should be a missing value in the data as opposed to a character string.

This is what I think it would look like in the schema:

"required": {
    "description": "Not relevant for point estimate output types. Must be a single array of null"
    "type": "array"
    "items": {
        "const": null,
        "maxItems": 1
    }
}

Demo

Here's a demo that shows that ["NA"] and [null] are equivalent by modifying a tasks.json file and reading them in with jsonlite

hub_con <- hubData::connect_hub(
  system.file("testhubs/simple", package = "hubUtils")
)
hp <- attributes(hub_con)$hub_path
# Get the tasks file which has a `mean` output type
tasks <- fs::path(hp, "hub-config", "tasks.json")

# copy the file to a new temp file
tmp <- withr::local_tempfile()
fs::file_copy(tasks, tmp, overwrite = TRUE)
# The two files are identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] TRUE

cfg <- readLines(tmp)
writeLines(cfg[231:240]) # optional is NA
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": ["NA"]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }

# Change '["NA"]' to '[null]'
nullcfg <- sub('["NA"]', '[null]', cfg, fixed = TRUE)
writeLines(nullcfg[231:240]) # optional is now null
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": [null]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }
writeLines(nullcfg, con = tmp) # changing to null

# The two files are no longer identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] FALSE
waldo::compare(cfg, nullcfg)
#> old[81:87] vs new[81:87]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","
#> 
#> old[232:238] vs new[232:238]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","

# The resulting R object after reading in JSON are identical
identical(
  jsonlite::fromJSON(tasks, simplifyDataFrame = FALSE), 
  jsonlite::fromJSON(tmp, simplifyDataFrame = FALSE)
)
#> [1] TRUE

Created on 2024-10-18 with reprex v2.1.1

Originally posted by @zkamvar in https://github.com/hubverse-org/schemas/issues/103#issuecomment-2423356765

annakrystalli commented 1 month ago

I'm totally onboard with this! 💯

The non-interlanguage compatibility of NA is something that's been brought up in the past, especially by @LucieContamin, discussed, but somewhat shelved until the bridge needed crossing. I agree this seems like a very good opportunity to cross that bridge.

I think we could easily even go one step cleaner and rather than a [null] array, just go for a null property all together (i.e. required: null. In terms of implementation, expand_model_out_grid() already has functionality to convert task ID optional and required properties which are both null (used for task IDs which are relevant to one modeling task but might not be to another in the same round) to NAs so I think it would be straightforward to apply that to output_type_ids too.

Other areas that would require work would be:

Overall, I think it's much cleaner, clearer and would make it much easier to communicate and explain the expectations for point estimate output type ids so worth the effort!

nickreich commented 1 month ago

I basically agree conceptually with the arguments. Could someone explain what a change like this would mean for what a submission file would look like? I.e. what would be included (if anything) in the output_type_id column when output_type="mean"?

What would the implications be for existing hubs that might try to migrate to this new schema format? E.g. are we at an early enough stage with the variant hub that we could just make this change and edit a few files and it would be fine? Does FluSight solicit mean forecasts and would they be stuck with schema 3.x?

annakrystalli commented 1 month ago

Nothing should really change for submissions, either current or future. We are still using R to validate so any missing values used should translate to NAs when read into R. This is the case for, e.g. values generated in python with pd.NA.

So from my perspective, if R is interpreting the values as NA, there should be no problems with validation or with opening datasets as a whole. I think this will be even more stable with parquet files as missing values are encoded in the parquet format and therefore are fully interoperable regardless of the framework generated. Having said the above it may well be wise to experiment with csv a bit to ensure mixing and matching such values from different framework doesn't cause problems, especially downstream when trying to open a dataset.

I've added some suggestions for testing in https://github.com/hubverse-org/hubDocs/issues/198