knowledgesystems / signal

Somatic Integration of Germline Alterations in cancer
https://www.signaldb.org/
MIT License
2 stars 5 forks source link

Adding Clinvar to genome nexus #104

Closed jjgao closed 3 years ago

jjgao commented 3 years ago

17:g.41276045_41276046del is the most prevalent pathogenic germline mutation, but the links to ClinVar and dbSNP is missing. It is linked to gnomAD, which contains link to dbSNP and ClinVar.

Similarly: https://www.signaldb.org/variant/17:g.41209079_41209080insG

Solution: Adding clinvar to genome nexus.

leexgh commented 3 years ago

@jjgao We get ClinVar from MyVariantInfo, but they don't have ClinVar for this particular variant. Actually I already asked them about the same variant before: https://github.com/biothings/myvariant.info/issues/106. Haven't heard back from them yet.

inodb commented 3 years ago

There might also be new updates to Ensembl's clinvar data so we might be able to use that over myvariant.info

leexgh commented 3 years ago

https://grch37.ensembl.org/info/genome/variation/species/sources_documentation.html GRCH37 ClinVar version: 12/2019

leexgh commented 3 years ago
  1. XML:
    • Pros: Full data
    • Cons: Too big to parse(15GB); Genomic coordinate is not unique; Hard to find a tool to do streaming parsing.
  2. VCF:
    • Pros: Easy to parse
    • Cons: Missing clinical significance data.
  3. API
    • Pros: Full data
    • Cons: Have access control(3 queries per second); Still need a tool to parse XML.
leexgh commented 3 years ago

We decide to use VCF. We should be consistant with clinVar interpretation and have linkout to their website, therefor using VCF is better.