opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Ingest disease/target evidence build from L2G prediction #3600

Closed DSuveges closed 3 weeks ago

DSuveges commented 3 weeks ago

L2G prediction establishes a connection between a credible set and a target with a corresponding l2g score. As the credible set is interpreted in the context of the studied disease, it allows us to convert this information into disease/target evidence by enriching the l2g prediction with the disease from the study index. This logic has been added to the genetics ETL and the evidence dataset is written as partitioned json dataset.

The schema of the dataset looks like this:

root
 |-- datatypeId: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- studyLocusId: string (nullable = true)

Where none of these values are ever expected to be null, contrary to the nullability flag. The size of the uncompressed dataset is ~443MB, and contains ~2M evidence strings. This number is not expected to grow significantly in the future.

One example piece of evidence:

{
  "datatypeId": "genetic_association",
  "datasourceId": "gwas_credible_sets",
  "targetFromSourceId": "ENSG00000231721",
  "diseaseFromSourceMappedId": "EFO_0004612",
  "resourceScore": 0.05567099619014385,
  "studyLocusId": "-1051305575299142413"
}

Important: the field studyLocusId hasn't been part of the evidence schema! Important: configuration of the new datasource needs to be included in the platform etl config. (see PR)

DSuveges commented 3 weeks ago

Regarding the fields: