As a developer I want to be able to use an arbitrary Gold Standard that follows the L2GGoldStandard schema because I want to stress test the L2G model with different training sets.
Background
Currently the LocusToGene step can only be trained with the OTG curation gold standard (gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json) that is in ndjson format and contains the following schema. See details below:
```
{
"association_info": {
"ancestry": [
"EUR"
],
"gwas_catalog_id": "GCST000324",
"otg_id": "GCST000324_3",
"neg_log_pval": 23.699,
"pubmed_id": "19185284"
},
"gold_standard_info": {
"evidence": [
{
"class": "expert curated",
"confidence": "High",
"curated_by": "Eric Fauman",
"description": "BCO1 (previously referred to as BCMO1) encodes beta-carotene oxygenase 1 which uses a molecule of oxygen to produce two molecules of retinol from beta-carotene. Enzyme deficiency results in accumulation of beta-carotene.",
"pubmed_id": "11401432"
}
],
"gene_id": "ENSG00000135697",
"highest_confidence": "High"
},
"metadata": {
"date_added": "2019-05-17",
"reviewed_by": "Ed Mountjoy",
"set_label": "ProGeM",
"submitted_by": "Eric Fauman",
"tags": [
"metabolite",
"mQTL"
]
},
"sentinel_variant": {
"alleles": {
"alternative": "G",
"reference": "T"
},
"locus_GRCh37": {
"chromosome": "16",
"position": 81264597
},
"rsid": "rs6564851",
"locus_GRCh38": {
"chromosome": "16",
"position": 81230992
}
},
"trait_info": {
"ontology": [
"HMDB0000561"
],
"reported_trait_name": "Carotenoid and tocopherol levels (beta-carotene)",
"standard_trait_name": "B-Carotene"
}
}
```
We use following gold standard fields for parsing:
[x] sentinel_variant -> to obtain the variantId(s) that will be used to infer the correct credible sets
[x] studyId
L2GGoldStandard preparation from OTG curation gold standard
In order to use a gold standard in the training mode in LocusToGeneStep it has to follow the L2GGoldStandard schema.
gold_standard_negatives class is not represented there. There are steps that parse the OTG curation gold standard to the L2GGoldStandard schema and dynamically generate the negatives.
To use the arbitrary gold standard in training mode of LocusToGeneStep we need to keep two conditions in place that arbitrary training set has to pass:
[x] should contain both gold_standard_positvie and gold_standard_negative classes
[x] should contain the L2GGoldStandard schema
Tasks
[x] Extract logic to prepare gold standard so it can take OTG Curation or any new arbitrary gold standard depending on the input schema
[x] Allow for reading a dataset in ndjson format ( to read the arbitrary gold standard)
Acceptance tests
[ ] successful LocusToGene train with OTG Curation after update
[x] successful LocusToGene train with arbitrary gold standard after update
As a developer I want to be able to use an arbitrary Gold Standard that follows the L2GGoldStandard schema because I want to stress test the L2G model with different training sets.
Background
Currently the
LocusToGene
step can only be trained with theOTG curation gold standard
(gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json) that is in ndjson format and contains the following schema. See details below:We use following gold standard fields for parsing:
sentinel_variant
-> to obtain the variantId(s) that will be used to infer the correct credible setsstudyId
L2GGoldStandard preparation from OTG curation gold standard
In order to use a gold standard in the training mode in LocusToGeneStep it has to follow the
L2GGoldStandard
schema.See the code that handles the L2GGoldStandard preparation from the OTG curation gold standard
The OTG curation contained primary classes of
gold_standard_positive
(class field)gold_standard_negatives
class is not represented there. There are steps that parse the OTG curation gold standard to the L2GGoldStandard schema and dynamically generate the negatives.To use the arbitrary gold standard in training mode of LocusToGeneStep we need to keep two conditions in place that arbitrary training set has to pass:
gold_standard_positvie
andgold_standard_negative
classesL2GGoldStandard
schemaTasks
Acceptance tests