Ingest disease/target evidence build from L2G prediction

L2G prediction establishes a connection between a credible set and a target with a corresponding l2g score. As the credible set is interpreted in the context of the studied disease, it allows us to convert this information into disease/target evidence by enriching the l2g prediction with the disease from the study index. This logic has been added to the genetics ETL and the evidence dataset is written as partitioned json dataset.

The schema of the dataset looks like this:

root
 |-- datatypeId: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- studyLocusId: string (nullable = true)

Where none of these values are ever expected to be null, contrary to the nullability flag. The size of the uncompressed dataset is ~443MB, and contains ~2M evidence strings. This number is not expected to grow significantly in the future.

One example piece of evidence:

{
  "datatypeId": "genetic_association",
  "datasourceId": "gwas_credible_sets",
  "targetFromSourceId": "ENSG00000231721",
  "diseaseFromSourceMappedId": "EFO_0004612",
  "resourceScore": 0.05567099619014385,
  "studyLocusId": "-1051305575299142413"
}

Important: the field studyLocusId hasn't been part of the evidence schema! Important: configuration of the new datasource needs to be included in the platform etl config. (see PR)

opentargets / issues

Ingest disease/target evidence build from L2G prediction #3600