harmonize location information?

bwalsh commented 7 years ago

Problem statement:

How to harmonize location information? i.e: For those entries without genomic location specifics, is it possible to retrieve appropriate fields and append them to the evidence record?

Worked example

Evidence without location information

original from source : https://civic.genome.wustl.edu/events/genes/58/summary/variants/1970/summary#variant

in g2p: https://g2p-ohsu.ddns.net/_plugin/kibana/app/kibana#/doc/associations/associations-new/association?id=AV6bnNSKd2hRurWfSY2g&_g=()

Can we take the gene and variant info and deduce more?

Methodology:

use gene and variant info from source VHL V130L (c.388G>C) to retrieve hits from clinvar
if hit(s) retrieve using idlist

Issues: what to select from clinvar? how to map to feature?

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=V130L+%28c.388G%3EC%29&retmode=json

  {
      "header": {
          "type": "esearch",
          "version": "0.3"
      },
      "esearchresult": {
          "count": "1",
          "retmax": "1",
          "retstart": "0",
          "idlist": [
              "2229"
          ],
          "translationset": [
          ],
          "translationstack": [
              {
                  "term": "VHL[All Fields]",
                  "field": "All Fields",
                  "count": "686",
                  "explode": "N"
              },
              {
                  "term": "c0x2e388G0x3eC[All Fields]",
                  "field": "All Fields",
                  "count": "5",
                  "explode": "N"
              },
              "AND",
              "GROUP"
          ],
          "querytranslation": "VHL[All Fields] AND c0x2e388G0x3eC[All Fields]"
      }
  }

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=2229&retmode=json'

{
   "header": {
       "type": "esummary",
       "version": "0.3"
   },
   "result": {
       "uids": [
           "2229"
       ],
       "2229": {
           "uid": "2229",
           "obj_type": "Simple",
           "accession": "",
           "accession_version": "",
           "title": "NM_000551.3(VHL):c.388G&gt;C (p.Val130Leu)",
           "variation_set": [
               {
                   "measure_id": "17268",
                   "variation_xrefs": [
                       {
                           "db_source": "UniProtKB",
                           "db_id": "P40337#VAR_005733"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "608537.0021"
                       },
                       {
                           "db_source": "dbSNP",
                           "db_id": "104893830"
                       }
                   ],
                   "variation_name": "NM_000551.3(VHL):c.388G&gt;C (p.Val130Leu)",
                   "cdna_change": "c.388G&gt;C (p.Val130Leu)",
                   "aliases": [
                   ],
                   "variation_loc": [
                       {
                           "status": "current",
                           "assembly_name": "GRCh38",
                           "chr": "3",
                           "band": "3p25;3p25.3",
                           "start": "10146561",
                           "stop": "10146561",
                           "inner_start": "",
                           "inner_stop": "",
                           "outer_start": "",
                           "outer_stop": "",
                           "display_start": "10146561",
                           "display_stop": "10146561",
                           "assembly_acc_ver": "GCF_000001405.33",
                           "annotation_release": "",
                           "alt": "C",
                           "ref": "G"
                       },
                       {
                           "status": "previous",
                           "assembly_name": "GRCh37",
                           "chr": "3",
                           "band": "3p25;3p25.3",
                           "start": "10188245",
                           "stop": "10188245",
                           "inner_start": "",
                           "inner_stop": "",
                           "outer_start": "",
                           "outer_stop": "",
                           "display_start": "10188245",
                           "display_stop": "10188245",
                           "assembly_acc_ver": "GCF_000001405.25",
                           "annotation_release": "",
                           "alt": "C",
                           "ref": "G"
                       }
                   ],
                   "allele_freq_set": [
                   ],
                   "variant_type": "single nucleotide variant"
               }
           ],
           "trait_set": [
               {
                   "trait_xrefs": [
                       {
                           "db_source": "Gene",
                           "db_id": "8056"
                       },
                       {
                           "db_source": "MedGen",
                           "db_id": "C1837915"
                       },
                       {
                           "db_source": "Orphanet",
                           "db_id": "238557"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "263400"
                       }
                   ],
                   "trait_name": "Erythrocytosis, familial, 2"
               },
               {
                   "trait_xrefs": [
                       {
                           "db_source": "MedGen",
                           "db_id": "C0019562"
                       },
                       {
                           "db_source": "Orphanet",
                           "db_id": "892"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "193300"
                       }
                   ],
                   "trait_name": "Von Hippel-Lindau syndrome"
               },
               {
                   "trait_xrefs": [
                       {
                           "db_source": "MedGen",
                           "db_id": "C0027672"
                       }
                   ],
                   "trait_name": "Hereditary cancer-predisposing syndrome"
               }
           ],
           "supporting_submissions": {
               "scv": [
                   "SCV000053262",
                   "SCV000580968",
                   "SCV000022475",
                   "SCV000264729"
               ],
               "rcv": [
                   "RCV000030586",
                   "RCV000002317",
                   "RCV000492250"
               ]
           },
           "clinical_significance": {
               "description": "Pathogenic",
               "last_evaluated": "2016/08/16 00:00",
               "review_status": "criteria provided, multiple submitters, no conflicts"
           },
           "record_status": "",
           "gene_sort": "VHL",
           "chr_sort": "03",
           "location_sort": "00000000000010146561",
           "variation_set_name": "",
           "variation_set_id": "",
           "genes": [
               {
                   "symbol": "VHL",
                   "geneid": "7428",
                   "strand": "+",
                   "source": "submitted"
               }
           ]
       }
   }
}

bwalsh commented 7 years ago

@mayfielg @jgoecks @ahwagner Can you take a look and comment?

jgoecks commented 7 years ago

We're currently using COSMIC for variant normalization. Using other sources instead of/in addition would be fine as well. COSMIC seemed simple and comprehensive at the time. My suggestion is mygene.info and myvariant.info The key is often being able to search by HGVS protein change + gene and find genomic location.
This variant is in COSMIC, so it's a bug that we're not normalizing it. The issue is that the CIViC harvester assumes that CIViC will provide genomic coordinates, but sometimes it does not. I can fix this easily if it's what we want to do.
I'd like to see more about which associations are not being normalized. Specifically, if biomarker type is working, how well are we doing on normalization of SNPs (ignoring fusions, CNVs, overexpression, etc.)? I had some code a while back that suggested we were normalizing >70% of SNPs for most sources.

jgoecks commented 7 years ago

Shirley Li of MolecularMatch pointed out TransVar (https://bitbucket.org/wanding/transvar), which looks like it may meet our needs as well.

malachig commented 7 years ago

Relevant resources for variant identities/aliases that were mentioned on the call:

Allele Registry TransVar MyVariantInfo ClinVar Cosmic

We use all of these in CIViC.

bwalsh commented 7 years ago

I've added genomic location information to the unharmonized dashboard Where:

unharmonized-features == NOT _exists_:features.start
unharmonized-biomarker_type == NOT _exists_:features.biomarker_type Both tables filter the associations detail below so you can see the raw data

jgoecks commented 7 years ago

Thanks. Quick look at the dashboard says we're closer than we think:

there are ~6500 unique entries that are not normalized
it appears that we're not yet normalizing oncokb biological variants, which I estimate to be 50-60% of unnormalized entries
many are fusions/oncogenic mutations/CNVs that we should not be counting as unnormalized or should be normalizing via gene coordinates

So we should be able get to 75%+ normalization with some minor improvements.

grmayfie commented 7 years ago

Hello all,

I've been looking into the oncokb portion of this issue, to normalize the genomic location of the biologic variants, and I've come across a separate apparent issue.

When I harvest locally and look at the 'feature' field, I have chrom, start, etc. in several documents where they're missing on dms-dev and on g2p-ohsu.ddns.net.

For example, at dms-dev.compbio.ohsu.edu, PIK3CA N345I shows no genomic locus info:

screen shot 2017-09-25 at 12 12 46 pm

Same thing on g2p-ohsu.ddns.net according to the unharmonized dashboard:

screen shot 2017-09-25 at 12 17 08 pm

Whereas, my local harvest does have the info.

screen shot 2017-09-25 at 12 20 35 pm

I'm thinking that a not insignificant portion of this issue we're seeing may actually just be old data. We should refresh out harvested docs at both sites and then see what the true counts on this issue is.

I'd like to do that first on dms-dev this afternoon, barring any complaints about needing to wait.

Also, somewhat unrelated, the unharmonized-evidence dashboard is not included in the kibana/everything.json file used to transfer the dashboards amongst our different sources, and it would be helpful if it was.

@jgoecks @bwalsh @ahwagner

bwalsh commented 7 years ago

@mayfielg : thanks. I've updated everything.json in v0.5 to include the unharmonized evidence dashboard.

On the issue above, both databases are the same, a fresh re-harvest of all sources.

A quick query shows same results on local & deployed servers. Perhaps it is the difference between oncokb and molecularmatch

PIK3CA AND N345I AND source:oncokb

difference in detail reported by sources...

grmayfie commented 7 years ago

No, I don't think it has to do with oncokb vs molecularmatch.

This is what I'm looking at on my local machine.

bwalsh commented 7 years ago

Perhaps it is the underlying tsv?

$ ls -l oncokb*.tsv
lrwxr-xr-x  1 walsbr  OHSUM01\Domain Users     43 Aug 22 09:12 oncokb_all_actionable_variants.tsv -> oncokb_all_actionable_variants_20170822.tsv
-rw-r--r--  1 walsbr  OHSUM01\Domain Users  55054 Aug 22 09:12 oncokb_all_actionable_variants_20170822.tsv

grmayfie commented 7 years ago

My file of that type is oncokb_all_actionable_variants_20170621.tsv.

So, yes it's possible that there's something funny going on there with different file versions. However, I don't think that's the heart of this issue.

When I pull down the newest from v0.5 and refresh everything (not using a backup, etc.), I get this:

screen shot 2017-09-25 at 4 34 30 pm

Which is half as many documents apparently missing genomic location as on g2p-ohsu.ddns.net.

screen shot 2017-09-25 at 4 35 39 pm

So clearly we have some sort of issue going on with the harvesting.

I assumed that whatever was on g2p-ohsu.ddns.net was just old and needed to be refreshed. If it has indeed been reharvested very recently with v0.5 code, then I'm concerned we may have a larger issue at play, because I don't know what would explain why we're getting different counts.

Also, note that my local count for all oncokb docs is 4074, i.e. there are ~2000 oncokb docs with genomic location. And at g2p-ohsu.ddns.net, there all 4149 oncokb docs, none of which have genomic location.

bwalsh commented 7 years ago

I found the issue, after re-creating cosmic_lookup_table.tsv the oncokb number improved ~ 43%. I have published this on g2p-ohsu.ddns.net

One curiosity remains. The jax source uses the same cosmic lookup, but the number remains unchanged? @mayfielg can you take a look?

Unharmonized Counts (NOT exists:features.start)

source	v0.5_Count	v0.5-9-26_Count	Delta
cgi	789	789	0
civic	311	311	0
jax	6819	6819	0
molecularmatch	493	491	-2
oncokb	4149	2338	-1811
pmkb		0	0
sage	69	69	0

grmayfie commented 6 years ago

Genomic location harmonization is now discussed in other issues, such as #88. Closing this issue as a duplicate.

ohsu-comp-bio / g2p-aggregator