Define json schema for other datatypes

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Define json schema for other datatypes #1486

Closed DSuveges closed 5 months ago

DSuveges commented 3 years ago

Currenty we can validate interactions sourced from Intact and all the disease-target evidence based on json schemas stored in the https://github.com/opentargets/json_schema repo. Having means to compare the validity of our data is advantageous so it make sense to create schema definitions for other datatypes eg. target safety, drugs etc.

This effort could involve in exploring further options for json schema definitions beyond the existing tools applied by our validator. The biggest issue is the current tools won't provide information on the reason why a certain document violates the schema.

ireneisdoomed commented 3 years ago

Prioritary datasets to focus on: EPMC submission, CHEMBL datasets (molecule primarily), interactions, evidence after pipeline.

ireneisdoomed commented 2 years ago

Pending dataset: probes. EPMC and interactions would have a big impact but right now the validation would take too long on these datasets.

prashantuniyal02 commented 1 year ago

@ireneisdoomed is this still in the queue?

ireneisdoomed commented 10 months ago

The data team has been rigorous in adding new schemas to validate the new data types we have incorporated. Having schemas has proved useful to let our providers validate their data. Now moving forward, I wouldn't extend its use to internal datasets in PIS or the ETL. JSON schemas are complex to maintain, having a validation strategy with Spark schemas as we do for the Genetics project is a better option.

DSuveges commented 10 months ago

You still need to maintain the spark schemas as well, but more importanty I see a value of the semantic checks that the json schema allowes. Eg. in this column only accept string that matches a certain pattern. Ideally, I would love to see a repository of pydantic field definitions eg. for target. Then in all data types that has a "target" field would use the same object for definition.

ireneisdoomed commented 10 months ago

Pydantic is a good alternative, I agree. In any case, it's about departing from JSON schemas.

DSuveges commented 5 months ago

Not planned.