hubverse-org / hubAdmin

Utilities for administering hubverse Infectious Disease Modeling Hubs
https://hubverse-org.github.io/hubAdmin/
Other
1 stars 2 forks source link

Add `write_config()` function (#3) #40

Closed annakrystalli closed 2 months ago

annakrystalli commented 2 months ago

This PR adds a write_config() function for writing out config class objects of programmatically created task configurations to JSON files.

It's been put off for some time because it's almost impossible to write a valid config file due to inconsistencies between R and JSON data types, in particular the fact that R has no concept of a scalar. As such some properties in the output file will likely not conform to schema expectations. They might be an ⁠array⁠ when a ⁠scalar⁠ is required or vice versa.

I tried to make use of the jsonvalidate::json_serialise() functionality, which can help serialise R objects in a schema aware way but I gave up because:

  1. I'm getting the following error when trying to run with our schema and config that I've not been able to debug, even with the help of chatGPT on jsonvalidate javascript source code.
    Error: TypeError: Cannot convert undefined or null to object
  2. The jsonvalidate::json_serialise() has been for some time now (since last year) only available in the development version of jsonvalidate with no indication when it might be pushed to CRAN.
  3. Ultimately, it would not be 100% guaranteed correct because json_serialise() cannot make use of oneOf statements which we use in our schema.
See code ``` r library(hubAdmin) rounds <- create_rounds( create_round( round_id_from_variable = TRUE, round_id = "origin_date", model_tasks = create_model_tasks( create_model_task( task_ids = create_task_ids( create_task_id("origin_date", required = NULL, optional = c( "2023-01-02", "2023-01-09", "2023-01-16" ) ), create_task_id("location", required = "US", optional = c("01", "02", "04", "05", "06") ), create_task_id("horizon", required = 1L, optional = 2:4 ) ), output_type = create_output_type( create_output_type_mean( is_required = TRUE, value_type = "double", value_minimum = 0L ) ), target_metadata = create_target_metadata( create_target_metadata_item( target_id = "inc hosp", target_name = "Weekly incident influenza hospitalizations", target_units = "rate per 100,000 population", target_keys = NULL, target_type = "discrete", is_step_ahead = TRUE, time_unit = "week" ) ) ) ), submissions_due = list( relative_to = "origin_date", start = -4L, end = 2L ) ) ) schema <- hubAdmin:::download_tasks_schema(format = "json") config <- create_config(rounds) jsonvalidate::json_serialise(config, schema) #> Error in eval(expr, envir, enclos): TypeError: Cannot convert undefined or null to object ```

As such, I've added a warning in the docs about this as well as issuing a message on write, advising users of the pitfalls and directing to use validate_config() to validate the config files written out and identify any deviations.

The PR therefore resolves #3 and completes #7

github-actions[bot] commented 2 months ago

🚀 Deployed on https://66d81c12064dad65fbd733eb--hubadmin-pr-previews.netlify.app

annakrystalli commented 2 months ago

Thanks @zkamvar !

Question: Are round-trips valid? That is, if you read in a valid schema file, modify it in some way (e.g. updating cdf values), and write it out again, would it still be valid?

Unfortunately not. These array/scalar errors will be re-introduced everytime the file is re-written.

Comment: (plz let me know if this has been discussed and invalidated before, I wasn't able to find anything when searching github) This is likely best addressed in a separate issue in the schemas repository, but I'm noticing that one of the problems is that we cannot perform the schema validation until the config is fully formed. I'm wondering if it would be prudent to create sub-schemas for the output types since these seem to cause the most headaches (e.g. your comment about json_serialize not being able to use oneOf clauses).

There is validation going on in the background but admittedly in places duplicating in R what jsonvalidate validation can perform. That was somewhat because the outputs I wanted in the create_*() family of fns was different to the output required for the validation errors table.

If I understand them correctly, if we use sub-schemas (as demonstrated in https://github.com/ropensci/jsonvalidate/issues/61 using the Direct Reference keyword), we could apply validation during creation of the individual pieces that we know we can validate. This way, it might be easier for us to apply workarounds since the number of variables that cause an invalid file are smaller.

I'd certainly be up for streamlining such code and sub-schema might be a good option. Here's some concerns I have:

It's probably worth an issue for further discussion somewhere else.