sodascience / metasyn

Transparent and privacy-friendly synthetic data generation
https://metasyn.readthedocs.io
MIT License
39 stars 9 forks source link

Version Generative Metadata Format based on metasyn python package #151

Open vankesteren opened 1 year ago

vankesteren commented 1 year ago

Introduction Since #103 we have not had static json schema files available, and we are validating the jsons internally, with dynamically generated files (see here). We did this because it is a lot of effort to manually write these JSON schemas and to keep updating it every time we make changes to this python package.

However, I think it is still good to have a versioned JSON schema available somewhere, for FAIRness; mainly interoperability with other software such as existing json validators and this thing.

Proposal Since we already have a lot of the necessary infrastructure in place, my proposal is as follows:

This aligns well with our recent changes, adding a cli (#142) and the inclusion of a docker container (#150).

Changes Changes are mainly in the validation script (using __version__ for the schema base). We could include generating the schema in the CLI, which would make the github action simpler and allow us to (re)generate previous schema versions using our versioned docker containers. The most work will be in creating the github action to push to a different repo (which we already do for our website so that's definitely possible)

The GMF repository will be kind of "broken", as the versions will change. I think this is okay.

Out of scope We should think about how to include / deal with plugins at a later date.

vankesteren commented 1 month ago

A relevant package we might want to look in relation to this is pydantic which does a lot of nice data validation stuff and automatically builds jsonschema.