Open d0choa opened 3 days ago
I think something like this in the ingestion part should be happening:
validated_studies = (
StudyIndex.from_parquet(session, study_paths)
.validate_target(gene_index)
.validate_disease(disease_index)
)
If any validation step fails, the qualityControls
column of the study will be added a flag indicating which test got failed. Downstream the passed and failed studies (with any quality control flag) will be split.
A related question to answer:
geneId
to targetId
to remain consistent with platform? targetFromSourceId
like evidence, then the validation would populate the targetId
column.What would be happening inside the methods is questionable, given the nested nature of the disease objects.
On your snippet, you might want to create an invalid dataset as well.
Code reference to how the platform-etl does it, in case it's useful for inspiration
Regarding geneId
and targetId
. In an ideal world, I agree we should probably use the targetFromSourceId
and targetFromSourceMappedId
for consistency. The API will turn targetFromSourceMappedId
into a fully fletched target
object.
The nesting of disease is really annoying
Currently, study indexes are created from different sources (e.g., GWAS Catalogue, eQTL catalog, Finngen) and written in a GCP bucket location. The study indexes are self-contained within the data sources, independent of other datasets, and contain all the necessary study metadata.
When a release is due, the study indexes are moved to a central location using airflow operators and considered as the final study index.
We want to expand this functionality to evaluate the study indexes and validate whether the records in them contain valid entities.
We would want 2 main functionalities similar to platform-etl:
In the context of study index, the fields to ensure validation are:
geneId
backgroundTraitFromSourceMappedIds
traitFromSourceMappedIds
Additionally,
biosampleFromSourceId
would also require validation, but because it currently contains both UBERON and CL codes, it might require additional considerations before validation.This might be the first of a series of actions to ensure every dataset is compatible in the context of the Platform-gentropy integration.