opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Arbitrary Gold Standards as input to L2G training #3625

Open project-defiant opened 1 week ago

project-defiant commented 1 week ago

As a developer I want to be able to use an arbitrary Gold Standard that follows the L2GGoldStandard schema because I want to stress test the L2G model with different training sets.

Background

Currently the LocusToGene step can only be trained with the OTG curation gold standard (gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json) that is in ndjson format and contains the following schema. See details below:

``` { "association_info": { "ancestry": [ "EUR" ], "gwas_catalog_id": "GCST000324", "otg_id": "GCST000324_3", "neg_log_pval": 23.699, "pubmed_id": "19185284" }, "gold_standard_info": { "evidence": [ { "class": "expert curated", "confidence": "High", "curated_by": "Eric Fauman", "description": "BCO1 (previously referred to as BCMO1) encodes beta-carotene oxygenase 1 which uses a molecule of oxygen to produce two molecules of retinol from beta-carotene. Enzyme deficiency results in accumulation of beta-carotene.", "pubmed_id": "11401432" } ], "gene_id": "ENSG00000135697", "highest_confidence": "High" }, "metadata": { "date_added": "2019-05-17", "reviewed_by": "Ed Mountjoy", "set_label": "ProGeM", "submitted_by": "Eric Fauman", "tags": [ "metabolite", "mQTL" ] }, "sentinel_variant": { "alleles": { "alternative": "G", "reference": "T" }, "locus_GRCh37": { "chromosome": "16", "position": 81264597 }, "rsid": "rs6564851", "locus_GRCh38": { "chromosome": "16", "position": 81230992 } }, "trait_info": { "ontology": [ "HMDB0000561" ], "reported_trait_name": "Carotenoid and tocopherol levels (beta-carotene)", "standard_trait_name": "B-Carotene" } } ```

We use following gold standard fields for parsing:

L2GGoldStandard preparation from OTG curation gold standard

In order to use a gold standard in the training mode in LocusToGeneStep it has to follow the L2GGoldStandard schema.

See the code that handles the L2GGoldStandard preparation from the OTG curation gold standard

The OTG curation contained primary classes of gold_standard_positive (class field)

1755 drug
 347 expert curated
  20 functional experimental
 345 functional observational

gold_standard_negatives class is not represented there. There are steps that parse the OTG curation gold standard to the L2GGoldStandard schema and dynamically generate the negatives.

To use the arbitrary gold standard in training mode of LocusToGeneStep we need to keep two conditions in place that arbitrary training set has to pass:

Tasks

Acceptance tests