Include cancers biomarkers as an evidence data source

ireneisdoomed commented 3 years ago

This ticket tracks the whole discussion on how the data has been modelled and parsed to be part of our target-disease evidence data sources as of the 21.09 release.

I'm copying here all the comments issued in the current PR(#89).

This PR processes the Cancer Biomarkers database available at gs://otar000-evidence_input/CancerBiomarkers/data_files. The proposed schema has been tracked and can be observed in this spreadsheet: https://docs.google.com/spreadsheets/d/1Mowq7KsGTMtEg3wZpJBNK_UbawHKJeM9d0syT9F9AMc/edit#gid=613866016

datasourceId

[X] cancer_genome_interpreter in favour of cancer_biomarkers. The latter rather describes the nature of the data.

datatypeId

[X] affected_pathway. The data describes how the gene-drug interactions are altered when one or multiple variants are present.

diseaseFromSource

[X] Split tumor types by ';' and explode

diseaseFromSourceMappedId

[X] Extract EFOs by joining the tumor type with the disease mapping dataset
[ ] This will eventually come via OnToma. The manual curation that cannot be automatised will be part of the manual disease mapping file which will be another input of OnToma.

drugFromSource

This can be found in Drug (if the specific drug is reported) or DrugFullName (mainly the drug full name)
[X] Coalesce both fields
[X] Split Drug by ';' and explode

drugId

[X] When the specific Drug is given and the mapping is straightforward, this field is joined with the latest drug index

drugResponse

Possible values: 'Resistant', 'Not Responsive', 'Increased Toxicity', 'Increased Toxicity (Ototoxicity)', 'Increased Toxicity (Myelosupression)', 'Responsive', 'Increased Toxicity (Haemolytic Anemia)', 'Increased Toxicity (Hyperbilirubinemia)'
[X] Mapped to EFO.
[x] Open issue to EFO to: a) create terms in EFO for 'Resistant', 'Not Responsive' and 'Increased Toxicity'; b) have these terms grouped under biological processto describe responses to drug
[X] This field is at the root level, not inside biomarkers. The described association is dependent on the presence of the biomarker, but the other relevant entities are also in the root so the relationship between them can be interpreted without the information encapsulated in biomarkers.

confidence

Possible values:

+-------------------------------+-----+
|EvidenceLevel                  |count|
+-------------------------------+-----+
|Pre-clinical                   |520  |
|Early trials                   |298  |
|Case report                    |267  |
|FDA guidelines                 |128  |
|European LeukemiaNet guidelines|106  |
|NCCN guidelines                |64   |
|Late trials                    |47   |
|CPIC guidelines                |4    |
|Clinical trials                |2    |
|NCCN/CAP guidelines            |2    |
|Early Trials,Case Report       |2    |
|Late trials,Pre-clinical       |2    |
+-------------------------------+-----+

[X] These will just be displayed but will not play a role for the scoring, as we know this dataset is the result of manual curation

literature

[X] PubMed IDs are the source for 75% of the records.
[X] Split by ';' and collected into an array.

urls

To source the other 25% without a PMID.

[X] Extraction of the sources dataset where the nice names and urls are gathered.
[X] Build a different structure whenever we have a CT: niceName will be the Clinical Trials and url is built using the NCT code.

targetFromSourceId

Every time that a biomarker consists of multiple variants that are not independent of each other, the biomarker is reported separating them with a '+'. When this situation happens, genes will be described under Gene separated by ';'. As we can only build evidence with a single target, these are separated into different evidence strings but the biomarker will reference both of them. These cases account for 27 distinct biomarkers.

[X] Split the Gene column by ';' and explode.
[X] Correct some gene names to the official symbol

biomarkers

Array of structs that will capture dependent and independent variants, as well as secondary fields to describe the mutation. The proposed fields are encapsulated in a struct so that a conceptual difference can be made when analysing data: variantId refers to a disease causing variant, whereas biomarkers.variantId adds the nuance of the biomarker having to be present for the association to occur.

biomarkers.name

[X] Name of the whole string that describes the biomarker - including all variants

biomarkers.individualMutation

[X] When multiple mutations are reported, this field indicates to which variant the whole record is pointing to.
This field is populated only when the functional consequence of the variant is MUT.

biomarkers.variantFunctionalConsequenceId

Possible values: 'MUT', 'CNA', 'FUS', 'EXPR', 'BIA'.
[X] Map the types of alteration to an SO code
MUT: somatic mutations --> SO:0001777
CNA: copy number alterations --> SO_0001563
FUS: fusion genes --> SO_0001882
EXPR: mRNA expression --> SO_0001540
BIA: biallelic inactivation --> not yet a proper term in SO. I've asked for its inclusion.
[ ] Open ticket to SO to find or create a term for BIA.
[X] This field will be under biomarkers, as there are cases where multiple variants are reported, each with a different consequence.

biomarkers.variantId

[x] Convert genomic coordinates to CHROM_POS_REF_ALT notation

biomarkers.variantRsId

[ ] Investigate the use of the upcoming tool PepVEP to map the variantId to rsIDs

biomarkers.variantAminoacidDescriptions
[ ] Investigate the use of the upcoming tool PepVEP to extract the changes at the protein level
NOTE: We would want to run this for all variants, not only cancer biomarkers

ireneisdoomed commented 3 years ago

New iter

New changes to the parser to adapt to latest schema (v3):

[X] biomarkerName is now in root
[X] biomarkers is now a struct of two arrays: variant, geneExpression
[X] biomarker.variant.functionalConsequenceId has been disambiguated when there were cases where multiple functional consequences and multiple alterations were reported for a single biomarker.
[X] It has also been accounted the cases where multiple changes to gene expression were reported. They are now split into different elements of the biomarker.geneExpression array.
[X] The name of the individual alterations have been cleaned

TO-DOs

[x] Some evidence strings currently do not pass validation due to some inaccuracies in the parsing of the variantId.
[x] This is one string in drugFromSource: [tamoxifen,letrozole,anastrozole,exemestane,fulvestrant,lhrh Analogues Or Antagonist]. I should clean it and explode it.
[x] There are 6 evidence strings in which confidence contains two levels of confidence. Split these by comma and explode the evidence strings.
[X] biomarkers at the moment is built solely based on biomarkerName. That is, if 2 diff alterations are given in the same biomarker. biomarker will describe those 2 alterations.
This should be the expected behaviour, although there's the situation where the alterations are found in 2 genes. In that case for target A, we will have biomarker A and biomarker B (and the same goes for target B). @Dsuveges, @tskir what do you think?
My feeling is that this approach is not wrong as the field is biomarker specific, although I'm concerned reporting 2 targets might bring ambiguity.

An example:

{
    "biomarkerName": "ARID1A amplification + ANXA1 overexpression",
    "biomarkers": {
        "geneExpression": [
            {
                "id": "GO_0010628",
                "name": "ANXA1:over"
            }
        ],
        "variant": [
            {
                "functionalConsequenceId": "SO_0001563",
                "name": "ARID1A:amp"
            }
        ]
    },
    "confidence": "Early trials",
    "datasourceId": "cancer_genome_interpreter",
    "datatypeId": "affected_pathway",
    "diseaseFromSource": "Breast adenocarcinoma",
    "diseaseFromSourceMappedId": "EFO_0000304",
    "drugFromSource": "Trastuzumab",
    "drugId": "CHEMBL1201585",
    "literature": [
        "27172896"
    ],
    "targetFromSourceId": "ARID1A"
},
{
    "biomarkerName": "ARID1A amplification + ANXA1 overexpression",
    "biomarkers": {
        "geneExpression": [
            {
                "id": "GO_0010628",
                "name": "ANXA1:over"
            }
        ],
        "variant": [
            {
                "functionalConsequenceId": "SO_0001563",
                "name": "ARID1A:amp"
            }
        ]
    },
    "confidence": "Early trials",
    "datasourceId": "cancer_genome_interpreter",
    "datatypeId": "affected_pathway",
    "diseaseFromSource": "Breast adenocarcinoma",
    "diseaseFromSourceMappedId": "EFO_0000304",
    "drugFromSource": "Trastuzumab",
    "drugId": "CHEMBL1201585",
    "literature": [
        "27172896"
    ],
    "targetFromSourceId": "ANXA1"
}

ireneisdoomed commented 3 years ago

The evidence file can be found at gs://otar000-evidence_input/CancerBiomarkers/json

DSuveges commented 3 years ago

I think your example shows a proper representation of the biomarker.

ireneisdoomed commented 3 years ago

Drug responses mapping to EFO scoped for 21.11 (#1746).

ireneisdoomed commented 2 years ago

Work completed and included in 21.11

opentargets / issues