Closed DSuveges closed 5 months ago
Prioritary datasets to focus on: EPMC submission, CHEMBL datasets (molecule primarily), interactions, evidence after pipeline.
Pending dataset: probes. EPMC and interactions would have a big impact but right now the validation would take too long on these datasets.
@ireneisdoomed is this still in the queue?
The data team has been rigorous in adding new schemas to validate the new data types we have incorporated. Having schemas has proved useful to let our providers validate their data. Now moving forward, I wouldn't extend its use to internal datasets in PIS or the ETL. JSON schemas are complex to maintain, having a validation strategy with Spark schemas as we do for the Genetics project is a better option.
You still need to maintain the spark schemas as well, but more importanty I see a value of the semantic checks that the json schema allowes. Eg. in this column only accept string that matches a certain pattern. Ideally, I would love to see a repository of pydantic field definitions eg. for target. Then in all data types that has a "target" field would use the same object for definition.
Pydantic is a good alternative, I agree. In any case, it's about departing from JSON schemas.
Not planned.
Currenty we can validate interactions sourced from Intact and all the disease-target evidence based on json schemas stored in the https://github.com/opentargets/json_schema repo. Having means to compare the validity of our data is advantageous so it make sense to create schema definitions for other datatypes eg. target safety, drugs etc.
This effort could involve in exploring further options for json schema definitions beyond the existing tools applied by our validator. The biggest issue is the current tools won't provide information on the reason why a certain document violates the schema.