hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 3 forks source link

Null point estimate output type IDs in configs instead of `NA`s #109

Open annakrystalli opened 3 hours ago

annakrystalli commented 3 hours ago

@zkamvar posted the following note on #103

Something that was brought up in response to https://github.com/reichlab/variant-nowcast-hub/pull/117#issuecomment-2423170370 is that the "NA" is a bit confusing because it sure looks like a character, but when we expand the grid the output_type_id columns become NA (which is an intentional move by Ooms described in section 2.1.1 of the JSONlite package paper)

Now that we are using is_required for point estimate types, we might be able to take this opportunity to set the required property to a single element null array. This will have exactly the same result as the "NA" array, but with the following advantages:

  1. inter-language compatibility can be achieved since null is a concept that even JSON can understand
  2. we can clearly communicate that this should be a missing value in the data as opposed to a character string.

This is what I think it would look like in the schema:

"required": {
    "description": "Not relevant for point estimate output types. Must be a single array of null"
    "type": "array"
    "items": {
        "const": null,
        "maxItems": 1
    }
}

Demo

Here's a demo that shows that ["NA"] and [null] are equivalent by modifying a tasks.json file and reading them in with jsonlite

hub_con <- hubData::connect_hub(
  system.file("testhubs/simple", package = "hubUtils")
)
hp <- attributes(hub_con)$hub_path
# Get the tasks file which has a `mean` output type
tasks <- fs::path(hp, "hub-config", "tasks.json")

# copy the file to a new temp file
tmp <- withr::local_tempfile()
fs::file_copy(tasks, tmp, overwrite = TRUE)
# The two files are identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] TRUE

cfg <- readLines(tmp)
writeLines(cfg[231:240]) # optional is NA
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": ["NA"]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }

# Change '["NA"]' to '[null]'
nullcfg <- sub('["NA"]', '[null]', cfg, fixed = TRUE)
writeLines(nullcfg[231:240]) # optional is now null
#>                 "output_type": {
#>                     "mean": {
#>                         "output_type_id": {
#>                             "required": null,
#>                             "optional": [null]
#>                         },
#>                         "value": {
#>                             "type": "integer",
#>                             "minimum": 0
#>                         }
writeLines(nullcfg, con = tmp) # changing to null

# The two files are no longer identical
unname(tools::md5sum(tmp) == tools::md5sum(tasks))
#> [1] FALSE
waldo::compare(cfg, nullcfg)
#> old[81:87] vs new[81:87]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","
#> 
#> old[232:238] vs new[232:238]
#>   "                    \"mean\": {"
#>   "                        \"output_type_id\": {"
#>   "                            \"required\": null,"
#> - "                            \"optional\": [\"NA\"]"
#> + "                            \"optional\": [null]"
#>   "                        },"
#>   "                        \"value\": {"
#>   "                            \"type\": \"integer\","

# The resulting R object after reading in JSON are identical
identical(
  jsonlite::fromJSON(tasks, simplifyDataFrame = FALSE), 
  jsonlite::fromJSON(tmp, simplifyDataFrame = FALSE)
)
#> [1] TRUE

Created on 2024-10-18 with reprex v2.1.1

Originally posted by @zkamvar in https://github.com/hubverse-org/schemas/issues/103#issuecomment-2423356765

annakrystalli commented 3 hours ago

I'm totally onboard with this! 💯

The non-interlanguage compatibility of NA is something that's been brought up in the past, especially by @LucieContamin, discussed, but somewhat shelved until the bridge needed crossing. I agree this seems like a very good opportunity to cross that bridge.

I think we could easily even go one step cleaner and rather than a [null] array, just go for a null property all together (i.e. required: null. In terms of implementation, expand_model_out_grid() already has functionality to convert task ID optional and required properties which are both null (used for task IDs which are relevant to one modeling task but might not be to another in the same round) to NAs so I think it would be straightforward to apply that to output_type_ids too.

Other areas that would require work would be:

Overall, I think it's much cleaner, clearer and would make it much easier to communicate and explain the expectations for point estimate output type ids so worth the effort!