Create JSON schema for hub tasks.json

annakrystalli commented 1 year ago

Re-opening this issue here as seems the best place to track conversations Originally opened in https://github.com/reichlab/hub-infrastructure-experiments/issues/3 Original pull request tracking developement of hubmeta schema (which is actually the tasks-schema.json and will be shortly submitted here: https://github.com/reichlab/hub-infrastructure-experiments/pull/4

The purpose of the schema is two-fold:

Act as documentation of expectations of vallid hubmeta.json.
Be used to validate json hubmeta files against.

PROs

simple one step validation which can be performed using many languages/tools. This means validation is not encode in language specific code.
human readability of file makes standards inspectable by users

CONs (within the hubUtils) context

package jsonvalidate which can be used to perform validations depends on package V8 and underlying V8 javascript and webassembly engine. Since 2020, it's much easier to install V8 (work out of the box with install.packages("V8") on maOS & Windows) it does need separate installation of the V8 C++ engine on Linux. This used to be problematic on systems e.g. without sudo rights but an alternative option to automatically download a static build of libv8 during package installation is now available.
While it appears json schema can be used to validate yaml, I can't find a tool for it in R. We can always however convert yaml to json prior to validating.

Initial resources

jsonvalidate R package
JSON Schema: A Media Type for Describing JSON Documents
Understanding JSON Schema ebook.
Online JSON to schema converter. Useful for getting started.
Frictionless data: Validating tabular data with a specialised JSON schema
Stencila use of JSON schema for validating notebook elements. Useful example of generating schema for custom objects.

annakrystalli commented 1 year ago

Results of experimentation of validating `$refs` pointers in our JSON documents.

While jsonvalidate::json_validate() offers a succinct way of accomplishing basic hubmeta.json validation, it is hard to validate $ref pointers in the json document being validated using functionality available through R.

From researching how to encode in the schema the possibility of a $ref object in lieu of any other property type, it seems that the standard way of handling this situation is to first resolve the pointers in the document being validated and then preform the validation using the schema. This actually makes sense because in the schema we want to be encoding the type and other criteria a property should match which can only be validated once any pointer are resolved.

The problem is that we have not found functionality in R to resolve pointers (e.g. functionality equivalent to the python solution @evan proposed here: https://github.com/reichlab/hub-infrastructure-experiments/blob/9d4889de34e5bc1e7df1b32a577caa6657c3384d/metadata-format-examples/complex-example/process-metadata-proposal.py#L1-L7).

The solution we currently have works on the r list after the file has been read into R using custom function in hubUtils substitute_refs: https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/ab81ae6b8afac11a52950f4272830ccd2e84a5e3/R/hubmeta-read.R#L80-L107

but the jsonvalidate package functions work on json so we would ideally want $ref pointer resolution to be carried out on the json formatted data prior to validation and reading into R.

Current options identified

Hacky workaround to allow for ref options in `schema.json` docs

One option would be to hard encode within the schema that any property could match either the defined expectation OR be a ref object.

I managed to encode this successfully for the 3 $refs to location values in the modified complex example in the json-schema-refs branch of my fork. You can see the diffs between my original proposal and the workaround to handle refs here: https://github.com/annakrystalli/hub-infrastructure-experiments/pull/1/files

Cons (many!)

Really verbose and hacky workaround. The change shown in the diff was to accommodate the validation of a single $ref pointer in just a single property! (the required property of the location task_id. https://github.com/annakrystalli/hub-infrastructure-experiments/blob/af6132871089884648bb20b3e05b350a91ad948c/json-schema/hubmeta-schema.json#L80-L102

We would have to effectively wrap every single property specification in these oneOf expressions throughout the whole nested structure of the schema! (unless I am missing something but at the moment I can't see a more succint way to do it).
Also hacky how I'm matching the $refs property name. Using it as the bare refs name with the ajv validation engine was being interpreted as an actual pointer in jsonvalidate::json_validator(), causing it to look for a non-existent string subschema file and throwing an error. This could be avoided by using the hacky "patternProperties" method I've gone for where the validation is applied to a property whose name regex matches $ref .
Finally (and likely most importantly!), the contents of the $ref pointers in $defs is NOT validated against the schema. The schema only validates that the ref points to a valid address within the schema.

Pros

Probably the single pro is that is will all be handled in the schema.json file but not much else good about this approach.

Resolving refs in R and re-converting to JSON.

The suggested workflow is to read the json config file into a list in R, resolve any refs with hubUtils:::substitute_refs() , convert back to JSON and then perform validation using jsonvalidate .

This also upfront feels really hacky and wasteful. The only saving grace is that such a workflow (i.e. reading into R and converting to JSON) may be required anyways to validate yml config files as I've not found (so far) any functionality for yml equivalent to jsonvalidate.

However, experimentation with this is also throwing up issues to do with how R serialises back to JSON.

In particular, it has to do with serialisation of vectors of length 1. The standard behaviour of toJSON is to serialise all vectors of length one to arrays with one element. This results in fields that were originally encoded as a simple "key": "value" pairs in JSON to now be encoded as "key": ["value"] arrays when re-serialised and fail validation as the schema is not expecting an array but rather a single value of a specific type.

A way to switch off this behaviour is to use the auto_unbox = TRUE in toJSON which means all vectors of length 1 would all be converted to "key": "value" pairs. This creates the opposite problem that properties defined as single element arrays would now be encoded as "key": "value" pairs and now these would fail validation.

One way to get around this is offered by jsonvalidate which allows serialisation to be informed by our schema using the following code:


# Read JSON into an R list
complex_mod_path <- here::here("json-schema", 
                               "modified-hubmeta-examples", 
                               "complex-hubmeta-mod.json")
json_list <- jsonlite::read_json(complex_mod_path,
                                 simplifyVector = TRUE,
                                 simplifyDataFrame = FALSE
) 

# Create new schema instance
schema <- jsonvalidate::json_schema$new(
    schema = here::here("json-schema", "hubmeta-schema.json"),
    engine = "ajv")

# Use Schema to serialise list to JSON
json <- schema$serialise(json_list)

# Use Schema to validate JSON
schema$validate(json)

Nice! BUT!...

When trying to run this, it stumbles on the fact that schema$serialise(json_list) cannot handle null elements and returns:

Error in context_eval(join(src), private$context, serialize, await) : 
  TypeError: Cannot convert undefined or null to object

😩

Wrap a python or other method for resolving JSON references

Instant massive additional dependency overhead. I'm not a big fan of this approach unless it can be super lean but happy to hear others thoughts about it!

Conclusion

Sorry for the huge length of this investigation report but I wanted to capture everything I tried to inform next steps.

At this stage, I'm actually leaning most towards opening an issue or two in jsonvalidate and see how amenable authors would be to:

handle conversion of null objects, especially if the schema itself allows (if possible of course)
see how they feel about adding functionality to resolve refs prior to validating. Their functions currently resolve ref pointers in the schema prior to validating so I'm wondering how much work it would be to also perform this on the json being validated?

Beyond that happy to hear other folks thoughts!

Obviously, the fall back is to do write code to do all the validation against schema within hubUtils ourselves. This would remove the V8 jsonvalidate dependency but also feels a big task given that a one liner using functions in jsonvalidate would (almost!) do the job!

annakrystalli commented 1 year ago

Have updated my notebook with some code and output of my experimentations: https://annakrystalli.me/hub-infrastructure-experiments/json-schema/jsonvalidate.html

Note all was run in this repo and branch: https://github.com/annakrystalli/hub-infrastructure-experiments/tree/json-schema-refs

elray1 commented 1 year ago

It seems like the second approach, resolving refs in R and converting back to json, is probably the way to go. The drawbacks you noted for the other ones (not actually validating the contents of referenced objects, introducing dependencies on other languages like Python, and then having the whole problem over again if we do want to support yaml) are pretty severe.

Your suggestion about adding a feature to jsonvalidate to handle conversion of null objects makes sense. I wonder if another option might be to:

modify our schema so that anything that's currently allowed to be null can just be omitted instead (such omitted items would be implicitly null)
then, drop any null objects from the processed R list before outputting it

The other request for jsonvalidate, about resolving references before validating, also makes sense. It seems like it would be a good feature for that package to have, but it feels like it might be a larger request (?) and also might not help if we want to validate yaml (?).

annakrystalli commented 1 year ago

Nice idea about just removing NULL fields! That should work nicely.

Re the validation of yaml, to use the jsonvalidate functions we would have to convert yaml files to JSON anyways so that shouldn't add any extra steps that we wouldn't have to perform anyways.

annakrystalli commented 1 year ago

FYI, I've opened a couple of issues in jsonvalidate regarding some of the issues we've been having:

annakrystalli commented 1 year ago

Been looking at workarounds for outstanding issues with jsonvalidate and have encountered an additional issue. Even when using the schema to re-serialise, if a property type has multiple options which include "array", of specific interest to us where type can be null or array (specified as "type": ["null", "array"]), array seems to suppress unboxing a NULL value which ends up being converted to [null] instead of just null. 😭

hubverse-org / schemas