ahwagner commented 6 years ago

Currently, 32% of our associations in 0.8 have no genomic coordinates assigned to at least one feature.

These fall into the following categories:

[ ] Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates
[ ] Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.
[ ] 7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
[ ] Association feature_names of the format <gene> amplification or <gene> loss should have one corresponding feature of representative transcript coordinates.
[ ] Gene fusions (feature_names of the format <gene>-<gene> or <gene>:Fusions) should have two corresponding features of representative transcript coordinates, one from each constituent gene.
[ ] Bucket mutations (e.g. <gene> mutation, <gene> inact mut) should have one corresponding feature of representative transcript coordinates.
[ ] exon mutations (containing a numbered exon in the feature_names) should use the exon coordinates for the provided transcript. If no transcript is provided, use the exon coordinates from the representative transcript.

Following these rules, we go from 68.4% of associations with variant start/end coords to 98.8%.

Many of these items refer to a "representative" transcript. This is annotated in ensembl as a "canonical" transcript.

Tagging as review per discussion with @jgoecks.

@bwalsh would you provide an estimate of the effort needed to make these changes?

I uploaded my WIP analysis workbook to the genie-analysis branch: https://github.com/ohsu-comp-bio/g2p-aggregator/blob/genie-analysis/notebooks/knowledgebase_comparison.ipynb

See the Feature coordinate filtering section for regular expressions that can help in selecting features for annotating with start/end coordinates.

bwalsh commented 6 years ago

@ahwagner @jgoecks @mayfielg - can we discuss these changes during Tuesday's call?

ahwagner commented 6 years ago

@bwalsh sounds good.

bwalsh commented 6 years ago

Notes:

Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates

Example:

{
  "name": "ABL1:E255K",
  "links": [
    "http://reg.genome.network/refseq/RS000009",
    "http://myvariant.info/v1/variant/chr9:g.133738363G>A?assembly=hg19",
    "http://reg.genome.network/refseq/RS000033",
    "http://www.ncbi.nlm.nih.gov/clinvar/variation/376090",
    "http://reg.genome.network/refseq/RS000057",
    "http://reg.genome.network/refseq/RS002120",
    "http://myvariant.info/v1/variant/chr9:g.130862976G>A?assembly=hg38",
    "http://www.ncbi.nlm.nih.gov/snp/121913448",
    "http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=12573",
    "http://www.ncbi.nlm.nih.gov/clinvar/?term=362969[alleleid]",
    "http://reg.genome.network/allele/CA16602551"
  ],
  "start": 133738363,
  "synonyms": [
    "NC_000009.10:g.132728184G>A",
    "NG_012034.1:g.154096G>A",
    "NC_000009.11:g.133738363G>A",
    "chr9:g.133738363G>A",
    "NC_000009.12:g.130862976G>A",
    "COSM12573",
    "chr9:g.130862976G>A",
    "CM000671.1:g.133738363G>A",
    "CM000671.2:g.130862976G>A"
  ],
  "biomarker_type": "mutant",
  "referenceName": "GRCh37",
  "geneSymbol": "ABL1",
  "alt": "A",
  "ref": "G",
  "chromosome": "9",
  "description": "ABL1:T315A,F317L,F317V,F317I,F317C,F317I,Y253H,E255K,E255V,F359V,F359C,F359I"
}

Rule:

# pseudo code
if feature.start and feature.ref and not feature.end:
  feature.end = feature.start + len(feature.ref) - 1

Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.

Allele registry does not have a name lookup.
- see http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/landing
How to verify cosmic lookup?

bwalsh commented 6 years ago

from #88

Currently, searches of the type Gene Variant (e.g. BRAF V600E) do a global OR search between terms. Ideally, we could rename the features.name field to the mutation name (without the prepended gene symbol) and by default limit the search to a global AND condition.

Subtasks:

remove prepended gene symbol in features.name, propogate changes to feature_names.keyword. change search behavior to AND by default

bwalsh commented 6 years ago

Association feature_names of the format amplification or loss should have one corresponding feature of representative transcript coordinates.

Using mygene.info and ensembl, we can look up gene location transcriptions fairly easily. Issues:

which transcription(s) to use?
how does 'amplification' or 'loss' enter into the algorithm?

Example:

From cgi

{u'Targeting': u'', u'Alteration': u'AR:amp', u'Source': u'PMID:23589709;PMID:21859989', u'cDNA': u'', u'Primary Tumor type': u'Prostate adenocarcinoma', u'individual_mutation': u'', u'Drug full name': u'AR inhibitor next gens', u'Association': u'Responsive', u'Drug family': u'[AR inhibitor next gen]', u'Biomarker': u'AR amplification', u'Drug': u'[]', u'Curator': u'RDientsmann', u'gDNA': u'', u'Drug status': u'', u'Gene': u'AR', u'transcript': u'', u'strand': u'', u'info': u'', u'Assay type': u'', u'Alteration type': u'CNA', u'region': u'', u'Evidence level': u'Pre-clinical', u'Primary Tumor acronym': u'PRAD', u'Metastatic Tumor Type': u''}

Feature after current processing:

{u'geneSymbol': u'AR', u'referenceName': u'GRCh37', u'biomarker_type': u'amplification', u'name': u'', u'description': u'AR:amp'}

Experiment:

query mygene.info

$ curl -X GET "http://mygene.info/v3/query?q=AR&fields=HGNC%2Censembl.gene" -H "accept: application/json"
"max_score": 370.9837,
"took": 24,
"total": 937,
"hits": [
{
  "HGNC": "644",
  "_id": "367",
  "_score": 370.9837,
  "ensembl": {
    "gene": "ENSG00000169083"
  }
},
{
  "_id": "11835",
  "_score": 306.0449,
  "ensembl": {
    "gene": "ENSMUSG00000046532"
  }
},
....

using the ensembl id, formulate the location of the gene

$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '{assembly_name: .assembly_name, chromosome: .seq_region_name, start: .start, end: .end  }'
{
"assembly_name": "GRCh37",
"chromosome": "X",
"start": 66764465,
"end": 66950461
}

Transcripts are also available

exactly how are these transcripts mapped to 'amplification' ?

$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '.Transcript'
[
{
"source": "ensembl_havana",
"object_type": "Transcript",
"logic_name": "ensembl_havana_transcript",
"Exon": [
  {
    "object_type": "Exon",
    "version": 7,
    "species": "homo_sapiens",
    "assembly_name": "GRCh37",
    "end": 66766604,
    "seq_region_name": "X",
    "db_type": "core",
    "strand": 1,
    "id": "ENSE00001326500",
    "start": 66764465
  },
  {
    "object_type": "Exon",
    "version": 1,
    "species": "homo_sapiens",
    "assembly_name": "GRCh37",
    "end": 66863249,
    "seq_region_name": "X",
    "db_type": "core",
    "strand": 1,
    "id": "ENSE00003606052",
    "start": 66863098
  },
....

bwalsh commented 6 years ago

Associations describing protein changes

A quick test shows this was a legitimate miss.

$ grep ERBB2 cosmic_lookup_table.tsv | grep S783P | wc -l
       0

Reviewing all the 187 unique gene/protein items these are the only ones that have grep 'hits'

(FGFR3, Y375C)       1
(FGFR2, M538I)       2
(FGFR2, C383R)       2
(FGFR3, K652E)       1
(FGFR3, G372C)       1

grep example:

$ grep FGFR2 cosmic_lookup_table.tsv | grep M538I
FGFR2_ENST00000369056   c.1614G>T   p.M538I 37  10  123258070   123258070   C   A   -
FGFR2_ENST00000457416   c.1614G>T   p.M538I 37  10  123258070   123258070   C   A   -

unit test

def test_misses():
    MISSES = ['FGFR3 Y375C', 'FGFR2 M538I', 'FGFR2 C383R', 'FGFR3 K652E', 'FGFR3 G372C']
    for profile in MISSES:
        gene_index, mut_index, biomarkers, fusions = jax._parse_profile(profile)
        if not (len(gene_index) == len(mut_index) == len(biomarkers)):
            print(
                "ERROR: This molecular profile has been parsed incorrectly!")
            print(json.dumps(
                {"molecular_profile": profile},
                indent=2, sort_keys=True))
        print gene_index, mut_index, biomarkers, fusions
        matches = LOOKUP_TABLE.get_entries(gene_index[0], mut_index[0])
        print matches

>>>

['FGFR3'] ['Y375C'] [''] []
[]
['FGFR2'] ['M538I'] [''] []
[]
['FGFR2'] ['C383R'] [''] []
[]
['FGFR3'] ['K652E'] [''] []
[]
['FGFR3'] ['G372C'] [''] []
[]

@mayfielg, @jgoecks - at least for these 4 examples, the identifiers exist, but are not being returned

ahwagner commented 6 years ago

On selecting canonical transcripts (for GrCh37), while this is mentioned in the Ensembl glossary, the only canonical transcripts I could find are those listed at UCSC. What follows are instructions for selecting those transcripts:

From Ensembl Gene .gtf file, gene symbol -> Gene transcripts:

zgrep 'gene_name "BRAF"' Homo_sapiens.GRCh37.87.gtf.gz | awk '($3 == "transcript")' | perl -pe 's/.*transcript_id "(\w+)".*/$1/' | tee BRAF.txt
ENST00000496384
ENST00000288602
ENST00000479537
ENST00000497784
ENST00000469930

Find UCSC high-quality transcripts:

zgrep -Ff BRAF.txt knownToEnsembl.txt.gz | cut -f1 | tee BRAF.ucsc.txt
uc003vwc.4

If multiple remaining transcripts, filter on the knownIsoform table (schema, file):

zgrep -Ff BRAF.ucsc.txt knownIsoforms.txt.gz | cut -f2 | tee BRAF.canonical.ucsc.txt
uc003vwc.4

Translate to refseq mRNA for mutalyzer using knownToRefseq table:

zgrep -Ff BRAF.canonical.ucsc.txt knownToRefSeq.txt.gz | cut -f2
NM_004333

From there, a query of NM_004333:p.V600E at mutalyzer returns a genomic coordinate!

The last step is to get the most recent version of the transcript before sending to allele registry--will write a response in this thread with those instructions later.

bwalsh commented 6 years ago

Protein lookup via myvariant.info

BRAF V600E example

$ curl -s http://myvariant.info/v1/query?q=BRAF%20V600E  | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "7",
  "start": 140534535,
  "end": 140534535,
  "ref": "A",
  "alt": "G"
}

others from our "misses"

$ curl -s http://myvariant.info/v1/query?q=FLT3%20N676D | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "13",
  "start": 28644637,
  "end": 28644637,
  "ref": "T",
  "alt": "A"
}

$ curl -s http://myvariant.info/v1/query?q=FGFR3%20G372C  | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt  }'
{
  "referenceName": "GRCh37",
  "chromosome": "4",
  "start": 1803650,
  "end": 1803650,
  "ref": "G",
  "alt": "A"
}

bwalsh commented 6 years ago

@ahwagner : can you take a look at the output in this file. It should correspond to the ppm_re 'found' items from your notebook. ppm_re.json.txt

bwalsh commented 6 years ago

quick call to get gene location

$ curl -s -X GET "http://mygene.info/v3/query?q=AR&fields=genomic_pos_hg19" -H "accept: application/json" | jq .hits[0].genomic_pos_hg19
{
  "chr": "X",
  "end": 66950461,
  "start": 66764465,
  "strand": 1
}

bwalsh commented 6 years ago

ahwagner commented 6 years ago

@bwalsh the quick call you mentioned in https://github.com/ohsu-comp-bio/g2p-aggregator/issues/88#issuecomment-367881406 is sufficient to grab the coordinates for many of the above sets. I have revised the requirements of these sets to take advantage of this API call, avoiding the difficulty of selecting a representative transcript:

[ ] 7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
[ ] Association feature_names of the format <gene> amplification or <gene> loss should have one corresponding feature of ~representative transcript~ gene coordinates.
[ ] Gene fusions (feature_names of the format <gene>-<gene> or <gene>:Fusions) should have two corresponding features of ~representative transcript~ gene coordinates, one from each constituent gene.
[ ] Bucket mutations (e.g. <gene> mutation, <gene> inact mut) should have one corresponding feature of ~representative transcript~ gene coordinates.
[ ] exon mutations (containing a numbered exon in the feature_names) should use the exon coordinates for the provided transcript. If no transcript is provided, ~use the exon coordinates from the representative transcript~ do not normalize.

bwalsh commented 6 years ago

7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).

Breakdown

all molecularmatch 2085
no feature_names associations 155
no feature_names and no feature.start 13
of those 13 associations without start [BRCA, PDGFR, PORCN, VEGF, VEGFR]
of those 5 genes 4 have no hits with genomic_pos_hg19; 1 [PORCN] have multiple answers for genomic_pos_hg19

Next steps:

@ahwagner: can you confirm you are seeing the same?
bwalsh: add code to handle multiple genomic_pos_hg19 answers. Strategy - skip chromosome *_PATCH

BRCA

$ curl 'http://mygene.info/v3/query?q=BRCA&fields=genomic_pos_hg19'
{
  "max_score": 27.281399,
  "took": 6,
  "total": 1,
  "hits": [
    {
      "_id": "3376065",
      "_score": 27.281399
    }
  ]
}

PDGFR

$ curl 'http://mygene.info/v3/query?q=PDGFR&fields=genomic_pos_hg19'
{
  "max_score": 10.167263,
  "took": 8,
  "total": 156,
  "hits": [
    {
      "_id": "100487523",
      "_score": 10.167263
    }

PORCN

$ curl 'http://mygene.info/v3/query?q=PORCN&fields=genomic_pos_hg19'
{
  "max_score": 446.9968,
  "took": 12,
  "total": 193,
  "hits": [
    {
      "_id": "64840",
      "_score": 446.9968,
      "genomic_pos_hg19": [
        {
          "chr": "HG1436_HG1432_PATCH",
          "end": 48380213,
          "start": 48368361,
          "strand": 1
        },
        {
          "chr": "X",
          "end": 48379202,
          "start": 48367350,
          "strand": 1
        }
      ]
    },

VEGF

$ curl 'http://mygene.info/v3/query?q=VEGF&fields=genomic_pos_hg19'
{
  "max_score": 354.96655,
  "took": 14,
  "total": 580,
  "hits": [
    {
      "_id": "104948932",
      "_score": 354.96655
    },

VEGFR

$ curl 'http://mygene.info/v3/query?q=VEGFR&fields=genomic_pos_hg19'
{
  "max_score": 12.964587,
  "took": 11,
  "total": 40,
  "hits": [
    {
      "_id": "443177",
      "_score": 12.964587
    },

grmayfie commented 6 years ago

@ahwagner Can we close? Please close if appropriate.

ohsu-comp-bio / g2p-aggregator

Add coordinates to genomic features #88

Notes:

exactly how are these transcripts mapped to 'amplification' ?

BRAF V600E example

others from our "misses"

quick call to get gene location

BRCA

PDGFR

PORCN

VEGF

VEGFR