Study index gentropy ETL step

d0choa commented 3 days ago

Currently, study indexes are created from different sources (e.g., GWAS Catalogue, eQTL catalog, Finngen) and written in a GCP bucket location. The study indexes are self-contained within the data sources, independent of other datasets, and contain all the necessary study metadata.

When a release is due, the study indexes are moved to a central location using airflow operators and considered as the final study index.

❯ gsutil ls gs://genetics_etl_python_playground/releases/24.06/study_index/
gs://genetics_etl_python_playground/releases/24.06/study_index/eqtl_catalogue/
gs://genetics_etl_python_playground/releases/24.06/study_index/finngen/
gs://genetics_etl_python_playground/releases/24.06/study_index/gwas_catalog/

We want to expand this functionality to evaluate the study indexes and validate whether the records in them contain valid entities.

We would want 2 main functionalities similar to platform-etl:

entity validation, with the consequent log of invalid studies and reason
recover obsoleted IDs by using the respective indexes with a history log

In the context of study index, the fields to ensure validation are:

geneId
backgroundTraitFromSourceMappedIds
traitFromSourceMappedIds

Additionally, biosampleFromSourceId would also require validation, but because it currently contains both UBERON and CL codes, it might require additional considerations before validation.

This might be the first of a series of actions to ensure every dataset is compatible in the context of the Platform-gentropy integration.

DSuveges commented 2 days ago

I think something like this in the ingestion part should be happening:

validated_studies = (
    StudyIndex.from_parquet(session, study_paths)
    .validate_target(gene_index)
    .validate_disease(disease_index)
)

If any validation step fails, the qualityControls column of the study will be added a flag indicating which test got failed. Downstream the passed and failed studies (with any quality control flag) will be split.

A related question to answer:

Should we change the geneId to targetId to remain consistent with platform?
Should we have a targetFromSourceId like evidence, then the validation would populate the targetId column.
The same with disease.

What would be happening inside the methods is questionable, given the nested nature of the disease objects.

d0choa commented 1 day ago

On your snippet, you might want to create an invalid dataset as well.

Code reference to how the platform-etl does it, in case it's useful for inspiration

Regarding geneId and targetId. In an ideal world, I agree we should probably use the targetFromSourceId and targetFromSourceMappedId for consistency. The API will turn targetFromSourceMappedId into a fully fletched target object.

The nesting of disease is really annoying

opentargets / issues

Study index gentropy ETL step #3359