Open ahwagner opened 6 years ago
@ahwagner @jgoecks @mayfielg - can we discuss these changes during Tuesday's call?
@bwalsh sounds good.
Associations harvested from CGI all lack end coordinates, despite ~50% having start coordinates
Example:
{
"name": "ABL1:E255K",
"links": [
"http://reg.genome.network/refseq/RS000009",
"http://myvariant.info/v1/variant/chr9:g.133738363G>A?assembly=hg19",
"http://reg.genome.network/refseq/RS000033",
"http://www.ncbi.nlm.nih.gov/clinvar/variation/376090",
"http://reg.genome.network/refseq/RS000057",
"http://reg.genome.network/refseq/RS002120",
"http://myvariant.info/v1/variant/chr9:g.130862976G>A?assembly=hg38",
"http://www.ncbi.nlm.nih.gov/snp/121913448",
"http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=12573",
"http://www.ncbi.nlm.nih.gov/clinvar/?term=362969[alleleid]",
"http://reg.genome.network/allele/CA16602551"
],
"start": 133738363,
"synonyms": [
"NC_000009.10:g.132728184G>A",
"NG_012034.1:g.154096G>A",
"NC_000009.11:g.133738363G>A",
"chr9:g.133738363G>A",
"NC_000009.12:g.130862976G>A",
"COSM12573",
"chr9:g.130862976G>A",
"CM000671.1:g.133738363G>A",
"CM000671.2:g.130862976G>A"
],
"biomarker_type": "mutant",
"referenceName": "GRCh37",
"geneSymbol": "ABL1",
"alt": "A",
"ref": "G",
"chromosome": "9",
"description": "ABL1:T315A,F317L,F317V,F317I,F317C,F317I,Y253H,E255K,E255V,F359V,F359C,F359I"
}
Rule:
# pseudo code
if feature.start and feature.ref and not feature.end:
feature.end = feature.start + len(feature.ref) - 1
Associations describing protein changes and indels (e.g. BRCA2 D806H or KIT W557_K558del or BRAF V487_P492delinsA or KIT S501_A502dup) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.
Allele registry does not have a name lookup.
How to verify cosmic lookup?
from #88
Currently, searches of the type Gene Variant (e.g. BRAF V600E) do a global OR search between terms. Ideally, we could rename the features.name field to the mutation name (without the prepended gene symbol) and by default limit the search to a global AND condition.
Subtasks:
remove prepended gene symbol in features.name, propogate changes to feature_names.keyword. change search behavior to AND by default
Association feature_names of the format
amplification or loss should have one corresponding feature of representative transcript coordinates.
Using mygene.info and ensembl, we can look up gene location transcriptions fairly easily. Issues:
Example:
From cgi
{u'Targeting': u'', u'Alteration': u'AR:amp', u'Source': u'PMID:23589709;PMID:21859989', u'cDNA': u'', u'Primary Tumor type': u'Prostate adenocarcinoma', u'individual_mutation': u'', u'Drug full name': u'AR inhibitor next gens', u'Association': u'Responsive', u'Drug family': u'[AR inhibitor next gen]', u'Biomarker': u'AR amplification', u'Drug': u'[]', u'Curator': u'RDientsmann', u'gDNA': u'', u'Drug status': u'', u'Gene': u'AR', u'transcript': u'', u'strand': u'', u'info': u'', u'Assay type': u'', u'Alteration type': u'CNA', u'region': u'', u'Evidence level': u'Pre-clinical', u'Primary Tumor acronym': u'PRAD', u'Metastatic Tumor Type': u''}
Feature after current processing:
{u'geneSymbol': u'AR', u'referenceName': u'GRCh37', u'biomarker_type': u'amplification', u'name': u'', u'description': u'AR:amp'}
Experiment:
query mygene.info
$ curl -X GET "http://mygene.info/v3/query?q=AR&fields=HGNC%2Censembl.gene" -H "accept: application/json"
"max_score": 370.9837,
"took": 24,
"total": 937,
"hits": [
{
"HGNC": "644",
"_id": "367",
"_score": 370.9837,
"ensembl": {
"gene": "ENSG00000169083"
}
},
{
"_id": "11835",
"_score": 306.0449,
"ensembl": {
"gene": "ENSMUSG00000046532"
}
},
....
using the ensembl id, formulate the location of the gene
$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '{assembly_name: .assembly_name, chromosome: .seq_region_name, start: .start, end: .end }'
{
"assembly_name": "GRCh37",
"chromosome": "X",
"start": 66764465,
"end": 66950461
}
Transcripts are also available
$ curl -s 'http://grch37.rest.ensembl.org/lookup/id/ENSG00000169083?content-type=application/json;expand=1' | jq '.Transcript'
[
{
"source": "ensembl_havana",
"object_type": "Transcript",
"logic_name": "ensembl_havana_transcript",
"Exon": [
{
"object_type": "Exon",
"version": 7,
"species": "homo_sapiens",
"assembly_name": "GRCh37",
"end": 66766604,
"seq_region_name": "X",
"db_type": "core",
"strand": 1,
"id": "ENSE00001326500",
"start": 66764465
},
{
"object_type": "Exon",
"version": 1,
"species": "homo_sapiens",
"assembly_name": "GRCh37",
"end": 66863249,
"seq_region_name": "X",
"db_type": "core",
"strand": 1,
"id": "ENSE00003606052",
"start": 66863098
},
....
Associations describing protein changes
A quick test shows this was a legitimate miss.
$ grep ERBB2 cosmic_lookup_table.tsv | grep S783P | wc -l
0
Reviewing all the 187 unique gene/protein items these are the only ones that have grep 'hits'
(FGFR3, Y375C) 1
(FGFR2, M538I) 2
(FGFR2, C383R) 2
(FGFR3, K652E) 1
(FGFR3, G372C) 1
grep example:
$ grep FGFR2 cosmic_lookup_table.tsv | grep M538I
FGFR2_ENST00000369056 c.1614G>T p.M538I 37 10 123258070 123258070 C A -
FGFR2_ENST00000457416 c.1614G>T p.M538I 37 10 123258070 123258070 C A -
unit test
def test_misses():
MISSES = ['FGFR3 Y375C', 'FGFR2 M538I', 'FGFR2 C383R', 'FGFR3 K652E', 'FGFR3 G372C']
for profile in MISSES:
gene_index, mut_index, biomarkers, fusions = jax._parse_profile(profile)
if not (len(gene_index) == len(mut_index) == len(biomarkers)):
print(
"ERROR: This molecular profile has been parsed incorrectly!")
print(json.dumps(
{"molecular_profile": profile},
indent=2, sort_keys=True))
print gene_index, mut_index, biomarkers, fusions
matches = LOOKUP_TABLE.get_entries(gene_index[0], mut_index[0])
print matches
>>>
['FGFR3'] ['Y375C'] [''] []
[]
['FGFR2'] ['M538I'] [''] []
[]
['FGFR2'] ['C383R'] [''] []
[]
['FGFR3'] ['K652E'] [''] []
[]
['FGFR3'] ['G372C'] [''] []
[]
@mayfielg, @jgoecks - at least for these 4 examples, the identifiers exist, but are not being returned
On selecting canonical transcripts (for GrCh37), while this is mentioned in the Ensembl glossary, the only canonical transcripts I could find are those listed at UCSC. What follows are instructions for selecting those transcripts:
zgrep 'gene_name "BRAF"' Homo_sapiens.GRCh37.87.gtf.gz | awk '($3 == "transcript")' | perl -pe 's/.*transcript_id "(\w+)".*/$1/' | tee BRAF.txt
ENST00000496384
ENST00000288602
ENST00000479537
ENST00000497784
ENST00000469930
Find UCSC high-quality transcripts:
zgrep -Ff BRAF.txt knownToEnsembl.txt.gz | cut -f1 | tee BRAF.ucsc.txt
uc003vwc.4
If multiple remaining transcripts, filter on the knownIsoform table (schema, file):
zgrep -Ff BRAF.ucsc.txt knownIsoforms.txt.gz | cut -f2 | tee BRAF.canonical.ucsc.txt
uc003vwc.4
zgrep -Ff BRAF.canonical.ucsc.txt knownToRefSeq.txt.gz | cut -f2
NM_004333
From there, a query of NM_004333:p.V600E at mutalyzer returns a genomic coordinate!
The last step is to get the most recent version of the transcript before sending to allele registry--will write a response in this thread with those instructions later.
Protein lookup via myvariant.info
$ curl -s http://myvariant.info/v1/query?q=BRAF%20V600E | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt }'
{
"referenceName": "GRCh37",
"chromosome": "7",
"start": 140534535,
"end": 140534535,
"ref": "A",
"alt": "G"
}
$ curl -s http://myvariant.info/v1/query?q=FLT3%20N676D | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt }'
{
"referenceName": "GRCh37",
"chromosome": "13",
"start": 28644637,
"end": 28644637,
"ref": "T",
"alt": "A"
}
$ curl -s http://myvariant.info/v1/query?q=FGFR3%20G372C | jq '.hits[0] | { referenceName: "GRCh37", chromosome: .chrom, start: .hg19.start, end: .hg19.end, ref: .vcf.ref, alt: .vcf.alt }'
{
"referenceName": "GRCh37",
"chromosome": "4",
"start": 1803650,
"end": 1803650,
"ref": "G",
"alt": "A"
}
@ahwagner : can you take a look at the output in this file. It should correspond to the ppm_re 'found' items from your notebook. ppm_re.json.txt
$ curl -s -X GET "http://mygene.info/v3/query?q=AR&fields=genomic_pos_hg19" -H "accept: application/json" | jq .hits[0].genomic_pos_hg19
{
"chr": "X",
"end": 66950461,
"start": 66764465,
"strand": 1
}
Related issue: https://github.com/ohsu-comp-bio/g2p-aggregator/issues/92
@bwalsh the quick call you mentioned in https://github.com/ohsu-comp-bio/g2p-aggregator/issues/88#issuecomment-367881406 is sufficient to grab the coordinates for many of the above sets. I have revised the requirements of these sets to take advantage of this API call, avoiding the difficulty of selecting a representative transcript:
feature_names
. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).feature_names
of the format <gene> amplification
or <gene> loss
should have one corresponding feature of ~representative transcript~ gene coordinates.feature_names
of the format <gene>-<gene>
or <gene>:Fusions
) should have two corresponding features of ~representative transcript~ gene coordinates, one from each constituent gene.<gene> mutation
, <gene> inact mut
) should have one corresponding feature of ~representative transcript~ gene coordinates.feature_names
) should use the exon coordinates for the provided transcript. If no transcript is provided, ~use the exon coordinates from the representative transcript~ do not normalize.7.2% of associations harvested from MolecularMatch have no feature_names. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).
Breakdown
Next steps:
multiple genomic_pos_hg19 answers
. Strategy - skip chromosome *_PATCH $ curl 'http://mygene.info/v3/query?q=BRCA&fields=genomic_pos_hg19'
{
"max_score": 27.281399,
"took": 6,
"total": 1,
"hits": [
{
"_id": "3376065",
"_score": 27.281399
}
]
}
$ curl 'http://mygene.info/v3/query?q=PDGFR&fields=genomic_pos_hg19'
{
"max_score": 10.167263,
"took": 8,
"total": 156,
"hits": [
{
"_id": "100487523",
"_score": 10.167263
}
$ curl 'http://mygene.info/v3/query?q=PORCN&fields=genomic_pos_hg19'
{
"max_score": 446.9968,
"took": 12,
"total": 193,
"hits": [
{
"_id": "64840",
"_score": 446.9968,
"genomic_pos_hg19": [
{
"chr": "HG1436_HG1432_PATCH",
"end": 48380213,
"start": 48368361,
"strand": 1
},
{
"chr": "X",
"end": 48379202,
"start": 48367350,
"strand": 1
}
]
},
$ curl 'http://mygene.info/v3/query?q=VEGF&fields=genomic_pos_hg19'
{
"max_score": 354.96655,
"took": 14,
"total": 580,
"hits": [
{
"_id": "104948932",
"_score": 354.96655
},
$ curl 'http://mygene.info/v3/query?q=VEGFR&fields=genomic_pos_hg19'
{
"max_score": 12.964587,
"took": 11,
"total": 40,
"hits": [
{
"_id": "443177",
"_score": 12.964587
},
@ahwagner Can we close? Please close if appropriate.
Currently, 32% of our associations in 0.8 have no genomic coordinates assigned to at least one feature.
These fall into the following categories:
BRCA2 D806H
orKIT W557_K558del
orBRAF V487_P492delinsA
orKIT S501_A502dup
) are >30% of missing features coordinates, and primarily come from OncoKB (1565 associations) and Jax CKB (668 associations). This should be handled by our COSMIC / allele registry lookups, might be a bug.feature_names
. They all have exactly 1 gene name. Collect the representative transcript start and stop coordinates for the feature (perhaps use mygene.info or ensembl genes for this purpose).feature_names
of the format<gene> amplification
or<gene> loss
should have one corresponding feature of representative transcript coordinates.feature_names
of the format<gene>-<gene>
or<gene>:Fusions
) should have two corresponding features of representative transcript coordinates, one from each constituent gene.<gene> mutation
,<gene> inact mut
) should have one corresponding feature of representative transcript coordinates.feature_names
) should use the exon coordinates for the provided transcript. If no transcript is provided, use the exon coordinates from the representative transcript.Following these rules, we go from 68.4% of associations with variant start/end coords to 98.8%.
Many of these items refer to a "representative" transcript. This is annotated in ensembl as a "canonical" transcript.
Tagging as
review
per discussion with @jgoecks.@bwalsh would you provide an estimate of the effort needed to make these changes?
I uploaded my WIP analysis workbook to the
genie-analysis
branch: https://github.com/ohsu-comp-bio/g2p-aggregator/blob/genie-analysis/notebooks/knowledgebase_comparison.ipynbSee the
Feature coordinate filtering
section for regular expressions that can help in selecting features for annotating with start/end coordinates.