Closed ireneisdoomed closed 2 years ago
New changes to the parser to adapt to latest schema (v3
):
biomarkerName
is now in rootbiomarkers
is now a struct of two arrays: variant
, geneExpression
biomarker.variant.functionalConsequenceId
has been disambiguated when there were cases where multiple functional consequences and multiple alterations were reported for a single biomarker.biomarker.geneExpression
array. [tamoxifen,letrozole,anastrozole,exemestane,fulvestrant,lhrh Analogues Or Antagonist]
. I should clean it and explode it.confidence
contains two levels of confidence. Split these by comma and explode the evidence strings. biomarkers
at the moment is built solely based on biomarkerName
. That is, if 2 diff alterations are given in the same biomarker. biomarker
will describe those 2 alterations.An example:
{
"biomarkerName": "ARID1A amplification + ANXA1 overexpression",
"biomarkers": {
"geneExpression": [
{
"id": "GO_0010628",
"name": "ANXA1:over"
}
],
"variant": [
{
"functionalConsequenceId": "SO_0001563",
"name": "ARID1A:amp"
}
]
},
"confidence": "Early trials",
"datasourceId": "cancer_genome_interpreter",
"datatypeId": "affected_pathway",
"diseaseFromSource": "Breast adenocarcinoma",
"diseaseFromSourceMappedId": "EFO_0000304",
"drugFromSource": "Trastuzumab",
"drugId": "CHEMBL1201585",
"literature": [
"27172896"
],
"targetFromSourceId": "ARID1A"
},
{
"biomarkerName": "ARID1A amplification + ANXA1 overexpression",
"biomarkers": {
"geneExpression": [
{
"id": "GO_0010628",
"name": "ANXA1:over"
}
],
"variant": [
{
"functionalConsequenceId": "SO_0001563",
"name": "ARID1A:amp"
}
]
},
"confidence": "Early trials",
"datasourceId": "cancer_genome_interpreter",
"datatypeId": "affected_pathway",
"diseaseFromSource": "Breast adenocarcinoma",
"diseaseFromSourceMappedId": "EFO_0000304",
"drugFromSource": "Trastuzumab",
"drugId": "CHEMBL1201585",
"literature": [
"27172896"
],
"targetFromSourceId": "ANXA1"
}
The evidence file can be found at gs://otar000-evidence_input/CancerBiomarkers/json
I think your example shows a proper representation of the biomarker.
Drug responses mapping to EFO scoped for 21.11 (#1746).
Work completed and included in 21.11
This ticket tracks the whole discussion on how the data has been modelled and parsed to be part of our target-disease evidence data sources as of the 21.09 release.
I'm copying here all the comments issued in the current PR(#89).
This PR processes the Cancer Biomarkers database available at
gs://otar000-evidence_input/CancerBiomarkers/data_files
. The proposed schema has been tracked and can be observed in this spreadsheet: https://docs.google.com/spreadsheets/d/1Mowq7KsGTMtEg3wZpJBNK_UbawHKJeM9d0syT9F9AMc/edit#gid=613866016datasourceId
cancer_genome_interpreter
in favour ofcancer_biomarkers
. The latter rather describes the nature of the data.datatypeId
affected_pathway
. The data describes how the gene-drug interactions are altered when one or multiple variants are present.diseaseFromSource
diseaseFromSourceMappedId
drugFromSource
Drug
(if the specific drug is reported) orDrugFullName
(mainly the drug full name)Drug
by ';' and explodedrugId
Drug
is given and the mapping is straightforward, this field is joined with the latest drug indexdrugResponse
biological process
to describe responses to drugbiomarkers
. The described association is dependent on the presence of the biomarker, but the other relevant entities are also in the root so the relationship between them can be interpreted without the information encapsulated in biomarkers.confidence
literature
urls
To source the other 25% without a PMID.
niceName
will be the Clinical Trials andurl
is built using the NCT code.targetFromSourceId
Every time that a biomarker consists of multiple variants that are not independent of each other, the biomarker is reported separating them with a '+'. When this situation happens, genes will be described under
Gene
separated by ';'. As we can only build evidence with a single target, these are separated into different evidence strings but the biomarker will reference both of them. These cases account for 27 distinct biomarkers.biomarkers
Array of structs that will capture dependent and independent variants, as well as secondary fields to describe the mutation. The proposed fields are encapsulated in a struct so that a conceptual difference can be made when analysing data:
variantId
refers to a disease causing variant, whereasbiomarkers.variantId
adds the nuance of the biomarker having to be present for the association to occur.biomarkers.name
biomarkers.individualMutation
biomarkers.variantFunctionalConsequenceId
BIA
.biomarkers
, as there are cases where multiple variants are reported, each with a different consequence.biomarkers.variantId
biomarkers.variantRsId
[ ] Investigate the use of the upcoming tool PepVEP to map the variantId to rsIDs
biomarkers.variantAminoacidDescriptions