macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Sometimes "symbol" disagrees with primary hgvs gene annotation #31

Closed simnim closed 6 years ago

simnim commented 7 years ago

Hi there, Really glad you guys shared your code and parsed data on github!

Mostly this issue is just to put it on your radar for something to improve for the next version, but notice that for the following entry we get conflicting information for the gene symbol:

On ClinVar it's EGFR: https://www.ncbi.nlm.nih.gov/clinvar/variation/45271/ For the hgvs_c we get EGFR: https://www.ncbi.nlm.nih.gov/nuccore/NM_005228

But in clinvar_alleles.single.b37.tsv we get EGFR-AS1:

> paste <(gzcat clinvar_alleles.single.b37.tsv.gz | head -n1 | tr '\t' '\n' | cat -n) <(gzcat clinvar_alleles.single.b37.tsv.gz | grep RCV000038427 | tr '\t' '\n') | column -t -s$'\t'
     1  chrom                  7
     2  pos                    55249063
     3  ref                    G
     4  alt                    A
     5  measureset_type        Variant
     6  measureset_id          45271
     7  rcv                    RCV000038427;RCV000321080
     8  allele_id              54438
     9  symbol                 EGFR-AS1
    10  hgvs_c                 NM_005228.4:c.2361G>A
    11  hgvs_p                 NP_005219.2:p.Gln787=
    12  molecular_consequence  NM_005228.4:c.2361G>A:synonymous variant
    13  clinical_significance  Benign;Likely benign
    14  pathogenic             0
    15  benign                 1
    16  conflicted             0
    17  review_status          criteria provided, multiple submitters, no conflicts
    18  gold_stars             2
    19  all_submitters         Laboratory for Molecular Medicine,Partners HealthCare Personalized Medicine;PreventionGenetics,PreventionGenetics;Illumina Clinical Services Laboratory,Illumina
    20  all_traits             not specified;Not Specified;NOT SPECIFIED;Lung cancer;Lung Cancer
    21  all_pmids              25741868;17409930,23562183,23667368,24627688,24846033,25311215
    22  inheritance_modes
    23  age_of_onset
    24  prevalence
    25  disease_mechanism
    26  origin                 germline;somatic
    27  xrefs                  MedGen:CN169374;Genetic Alliance:Lung+Cancer/4334;Genetics Home Reference:lung-cancer;MedGen:C0684249;OMIM:211980;SNOMED CT:187875007

This can be traced to the following code from parse_clinical_xml.py:

            #find the gene symbol
            current_row['symbol']=''
            genesymbol = measure[i].findall('.//Symbol')
            if genesymbol is not None:
                for symbol in genesymbol:
                    if(symbol.find('ElementValue').attrib.get('Type')=='Preferred'):
                        current_row['symbol']=symbol.find('ElementValue').text;
                        break

Notice how we break after the first success, but for this example we have multiple "Preferred" symbols in the xml in a confusing order.

<ClinVarSet ID="17452916">
...
    <ClinVarAccession Acc="RCV000038427" Version="3" Type="RCV" DateUpdated="2017-01-25"/>
...
    <MeasureSet Type="Variant" ID="45271">
      <Measure Type="single nucleotide variant" ID="54438">
        <Name>
          <ElementValue Type="Preferred">NM_005228.4(EGFR):c.2361G&gt;A (p.Gln787=)</ElementValue>
        </Name>
...
          <Symbol>
            <ElementValue Type="Preferred">EGFR-AS1</ElementValue>
          </Symbol>
...
          <Symbol>
            <ElementValue Type="Preferred">EGFR</ElementValue>
          </Symbol>
...

Perhaps one might want to simply use the gene symbol given in parenthesis in the text for the first .//Name/ElementValue in the .//Measure

I'll admit I haven't done an extensive analysis on the best choice for extracting the gene symbol from the xml, and the solution I just tossed your way ^ might be incorrect, but right now the symbol column in your output is not always reliable.

Also, check out (the first one I noticed via grep) https://www.ncbi.nlm.nih.gov/clinvar/variation/41400 for an example where there's no "Preferred" gene symbol, but there is a symbol in parenthesis given that seems to check out. The XML does not have a "Preferred" symbol in this case.

XiaoleiZ commented 6 years ago

The new release fixed this.