Closed bwalsh closed 6 years ago
@mayfielg @jgoecks @ahwagner Can you take a look and comment?
Shirley Li of MolecularMatch pointed out TransVar (https://bitbucket.org/wanding/transvar), which looks like it may meet our needs as well.
Relevant resources for variant identities/aliases that were mentioned on the call:
Allele Registry TransVar MyVariantInfo ClinVar Cosmic
We use all of these in CIViC.
I've added genomic location information to the unharmonized dashboard Where:
unharmonized-features
== NOT _exists_:features.start
unharmonized-biomarker_type
== NOT _exists_:features.biomarker_type
Both tables filter the associations detail below so you can see the raw dataThanks. Quick look at the dashboard says we're closer than we think:
So we should be able get to 75%+ normalization with some minor improvements.
Hello all,
I've been looking into the oncokb portion of this issue, to normalize the genomic location of the biologic variants, and I've come across a separate apparent issue.
When I harvest locally and look at the 'feature' field, I have chrom
, start
, etc. in several documents where they're missing on dms-dev and on g2p-ohsu.ddns.net
.
For example, at dms-dev.compbio.ohsu.edu
, PIK3CA N345I shows no genomic locus info:
Same thing on g2p-ohsu.ddns.net
according to the unharmonized dashboard:
Whereas, my local harvest does have the info.
I'm thinking that a not insignificant portion of this issue we're seeing may actually just be old data. We should refresh out harvested docs at both sites and then see what the true counts on this issue is.
I'd like to do that first on dms-dev this afternoon, barring any complaints about needing to wait.
Also, somewhat unrelated, the unharmonized-evidence dashboard is not included in the kibana/everything.json
file used to transfer the dashboards amongst our different sources, and it would be helpful if it was.
@jgoecks @bwalsh @ahwagner
@mayfielg : thanks. I've updated everything.json in v0.5 to include the unharmonized evidence dashboard.
On the issue above, both databases are the same, a fresh re-harvest of all sources.
A quick query shows same results on local & deployed servers. Perhaps it is the difference between oncokb and molecularmatch
No, I don't think it has to do with oncokb vs molecularmatch.
This is what I'm looking at on my local machine.
Perhaps it is the underlying tsv?
$ ls -l oncokb*.tsv
lrwxr-xr-x 1 walsbr OHSUM01\Domain Users 43 Aug 22 09:12 oncokb_all_actionable_variants.tsv -> oncokb_all_actionable_variants_20170822.tsv
-rw-r--r-- 1 walsbr OHSUM01\Domain Users 55054 Aug 22 09:12 oncokb_all_actionable_variants_20170822.tsv
My file of that type is oncokb_all_actionable_variants_20170621.tsv
.
So, yes it's possible that there's something funny going on there with different file versions. However, I don't think that's the heart of this issue.
When I pull down the newest from v0.5 and refresh everything (not using a backup, etc.), I get this:
Which is half as many documents apparently missing genomic location as on g2p-ohsu.ddns.net
.
So clearly we have some sort of issue going on with the harvesting.
I assumed that whatever was on g2p-ohsu.ddns.net
was just old and needed to be refreshed. If it has indeed been reharvested very recently with v0.5 code, then I'm concerned we may have a larger issue at play, because I don't know what would explain why we're getting different counts.
Also, note that my local count for all oncokb docs is 4074, i.e. there are ~2000 oncokb docs with genomic location. And at g2p-ohsu.ddns.net
, there all 4149 oncokb docs, none of which have genomic location.
I found the issue, after re-creating cosmic_lookup_table.tsv
the oncokb number improved ~ 43%. I have published this on g2p-ohsu.ddns.net
One curiosity remains. The jax
source uses the same cosmic lookup, but the number remains unchanged? @mayfielg can you take a look?
source | v0.5_Count | v0.5-9-26_Count | Delta |
---|---|---|---|
cgi | 789 | 789 | 0 |
civic | 311 | 311 | 0 |
jax | 6819 | 6819 | 0 |
molecularmatch | 493 | 491 | -2 |
oncokb | 4149 | 2338 | -1811 |
pmkb | 0 | 0 | |
sage | 69 | 69 | 0 |
Genomic location harmonization is now discussed in other issues, such as #88. Closing this issue as a duplicate.
Problem statement:
How to
harmonize
location information? i.e: For those entries without genomic location specifics, is it possible to retrieve appropriate fields and append them to the evidence record?Worked example
Evidence without location information
original from source : https://civic.genome.wustl.edu/events/genes/58/summary/variants/1970/summary#variant
in g2p: https://g2p-ohsu.ddns.net/_plugin/kibana/app/kibana#/doc/associations/associations-new/association?id=AV6bnNSKd2hRurWfSY2g&_g=()
Can we take the gene and variant info and deduce more?
Methodology:
VHL V130L (c.388G>C)
to retrieve hits from clinvaridlist
Issues: what to select from clinvar? how to map to feature?
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=V130L+%28c.388G%3EC%29&retmode=json
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=2229&retmode=json'