hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 2 forks source link

Create JSON schema for hub tasks.json #1

Closed annakrystalli closed 1 year ago

annakrystalli commented 1 year ago

Re-opening this issue here as seems the best place to track conversations Originally opened in https://github.com/reichlab/hub-infrastructure-experiments/issues/3 Original pull request tracking developement of hubmeta schema (which is actually the tasks-schema.json and will be shortly submitted here: https://github.com/reichlab/hub-infrastructure-experiments/pull/4

The purpose of the schema is two-fold:

  1. Act as documentation of expectations of vallid hubmeta.json.
  2. Be used to validate json hubmeta files against.

PROs

CONs (within the hubUtils) context

Initial resources

annakrystalli commented 1 year ago

Results of experimentation of validating $refs pointers in our JSON documents.

While jsonvalidate::json_validate() offers a succinct way of accomplishing basic hubmeta.json validation, it is hard to validate $ref pointers in the json document being validated using functionality available through R.

From researching how to encode in the schema the possibility of a $ref object in lieu of any other property type, it seems that the standard way of handling this situation is to first resolve the pointers in the document being validated and then preform the validation using the schema. This actually makes sense because in the schema we want to be encoding the type and other criteria a property should match which can only be validated once any pointer are resolved.

The problem is that we have not found functionality in R to resolve pointers (e.g. functionality equivalent to the python solution @evan proposed here: https://github.com/reichlab/hub-infrastructure-experiments/blob/9d4889de34e5bc1e7df1b32a577caa6657c3384d/metadata-format-examples/complex-example/process-metadata-proposal.py#L1-L7).

The solution we currently have works on the r list after the file has been read into R using custom function in hubUtils substitute_refs: https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/ab81ae6b8afac11a52950f4272830ccd2e84a5e3/R/hubmeta-read.R#L80-L107

but the jsonvalidate package functions work on json so we would ideally want $ref pointer resolution to be carried out on the json formatted data prior to validation and reading into R.

Current options identified

Hacky workaround to allow for ref options in schema.json docs

One option would be to hard encode within the schema that any property could match either the defined expectation OR be a ref object.

I managed to encode this successfully for the 3 $refs to location values in the modified complex example in the json-schema-refs branch of my fork. You can see the diffs between my original proposal and the workaround to handle refs here: https://github.com/annakrystalli/hub-infrastructure-experiments/pull/1/files

Cons (many!)
Pros

Probably the single pro is that is will all be handled in the schema.json file but not much else good about this approach.

Resolving refs in R and re-converting to JSON.

The suggested workflow is to read the json config file into a list in R, resolve any refs with hubUtils:::substitute_refs() , convert back to JSON and then perform validation using jsonvalidate .

This also upfront feels really hacky and wasteful. The only saving grace is that such a workflow (i.e. reading into R and converting to JSON) may be required anyways to validate yml config files as I've not found (so far) any functionality for yml equivalent to jsonvalidate.

However, experimentation with this is also throwing up issues to do with how R serialises back to JSON.

In particular, it has to do with serialisation of vectors of length 1. The standard behaviour of toJSON is to serialise all vectors of length one to arrays with one element. This results in fields that were originally encoded as a simple "key": "value" pairs in JSON to now be encoded as "key": ["value"] arrays when re-serialised and fail validation as the schema is not expecting an array but rather a single value of a specific type.

A way to switch off this behaviour is to use the auto_unbox = TRUE in toJSON which means all vectors of length 1 would all be converted to "key": "value" pairs. This creates the opposite problem that properties defined as single element arrays would now be encoded as "key": "value" pairs and now these would fail validation.

One way to get around this is offered by jsonvalidate which allows serialisation to be informed by our schema using the following code:


# Read JSON into an R list
complex_mod_path <- here::here("json-schema", 
                               "modified-hubmeta-examples", 
                               "complex-hubmeta-mod.json")
json_list <- jsonlite::read_json(complex_mod_path,
                                 simplifyVector = TRUE,
                                 simplifyDataFrame = FALSE
) 

# Create new schema instance
schema <- jsonvalidate::json_schema$new(
    schema = here::here("json-schema", "hubmeta-schema.json"),
    engine = "ajv")

# Use Schema to serialise list to JSON
json <- schema$serialise(json_list)

# Use Schema to validate JSON
schema$validate(json)

Nice! BUT!...

When trying to run this, it stumbles on the fact that schema$serialise(json_list) cannot handle null elements and returns:

Error in context_eval(join(src), private$context, serialize, await) : 
  TypeError: Cannot convert undefined or null to object

😩

Wrap a python or other method for resolving JSON references

Instant massive additional dependency overhead. I'm not a big fan of this approach unless it can be super lean but happy to hear others thoughts about it!

Conclusion

Sorry for the huge length of this investigation report but I wanted to capture everything I tried to inform next steps.

At this stage, I'm actually leaning most towards opening an issue or two in jsonvalidate and see how amenable authors would be to:

Beyond that happy to hear other folks thoughts!

Obviously, the fall back is to do write code to do all the validation against schema within hubUtils ourselves. This would remove the V8 jsonvalidate dependency but also feels a big task given that a one liner using functions in jsonvalidate would (almost!) do the job!

annakrystalli commented 1 year ago

Have updated my notebook with some code and output of my experimentations: https://annakrystalli.me/hub-infrastructure-experiments/json-schema/jsonvalidate.html

Note all was run in this repo and branch: https://github.com/annakrystalli/hub-infrastructure-experiments/tree/json-schema-refs

elray1 commented 1 year ago

It seems like the second approach, resolving refs in R and converting back to json, is probably the way to go. The drawbacks you noted for the other ones (not actually validating the contents of referenced objects, introducing dependencies on other languages like Python, and then having the whole problem over again if we do want to support yaml) are pretty severe.

Your suggestion about adding a feature to jsonvalidate to handle conversion of null objects makes sense. I wonder if another option might be to:

The other request for jsonvalidate, about resolving references before validating, also makes sense. It seems like it would be a good feature for that package to have, but it feels like it might be a larger request (?) and also might not help if we want to validate yaml (?).

annakrystalli commented 1 year ago

Nice idea about just removing NULL fields! That should work nicely.

Re the validation of yaml, to use the jsonvalidate functions we would have to convert yaml files to JSON anyways so that shouldn't add any extra steps that we wouldn't have to perform anyways.

annakrystalli commented 1 year ago

FYI, I've opened a couple of issues in jsonvalidate regarding some of the issues we've been having:

annakrystalli commented 1 year ago

Been looking at workarounds for outstanding issues with jsonvalidate and have encountered an additional issue. Even when using the schema to re-serialise, if a property type has multiple options which include "array", of specific interest to us where type can be null or array (specified as "type": ["null", "array"]), array seems to suppress unboxing a NULL value which ends up being converted to [null] instead of just null. 😭