opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Study index gentropy ETL step #3359

Open d0choa opened 3 days ago

d0choa commented 3 days ago

Currently, study indexes are created from different sources (e.g., GWAS Catalogue, eQTL catalog, Finngen) and written in a GCP bucket location. The study indexes are self-contained within the data sources, independent of other datasets, and contain all the necessary study metadata.

When a release is due, the study indexes are moved to a central location using airflow operators and considered as the final study index.

❯ gsutil ls gs://genetics_etl_python_playground/releases/24.06/study_index/
gs://genetics_etl_python_playground/releases/24.06/study_index/eqtl_catalogue/
gs://genetics_etl_python_playground/releases/24.06/study_index/finngen/
gs://genetics_etl_python_playground/releases/24.06/study_index/gwas_catalog/

We want to expand this functionality to evaluate the study indexes and validate whether the records in them contain valid entities.

We would want 2 main functionalities similar to platform-etl:

In the context of study index, the fields to ensure validation are:

Additionally, biosampleFromSourceId would also require validation, but because it currently contains both UBERON and CL codes, it might require additional considerations before validation.

This might be the first of a series of actions to ensure every dataset is compatible in the context of the Platform-gentropy integration.

DSuveges commented 2 days ago

I think something like this in the ingestion part should be happening:

validated_studies = (
    StudyIndex.from_parquet(session, study_paths)
    .validate_target(gene_index)
    .validate_disease(disease_index)
)

If any validation step fails, the qualityControls column of the study will be added a flag indicating which test got failed. Downstream the passed and failed studies (with any quality control flag) will be split.

A related question to answer:

What would be happening inside the methods is questionable, given the nested nature of the disease objects.

d0choa commented 1 day ago

On your snippet, you might want to create an invalid dataset as well.

Code reference to how the platform-etl does it, in case it's useful for inspiration

Regarding geneId and targetId. In an ideal world, I agree we should probably use the targetFromSourceId and targetFromSourceMappedId for consistency. The API will turn targetFromSourceMappedId into a fully fletched target object.

The nesting of disease is really annoying