Hi there,
Really glad you guys shared your code and parsed data on github!
Mostly this issue is just to put it on your radar for something to improve for the next version, but notice that for the following entry we get conflicting information for the gene symbol:
This can be traced to the following code from parse_clinical_xml.py:
#find the gene symbol
current_row['symbol']=''
genesymbol = measure[i].findall('.//Symbol')
if genesymbol is not None:
for symbol in genesymbol:
if(symbol.find('ElementValue').attrib.get('Type')=='Preferred'):
current_row['symbol']=symbol.find('ElementValue').text;
break
Notice how we break after the first success, but for this example we have multiple "Preferred" symbols in the xml in a confusing order.
Perhaps one might want to simply use the gene symbol given in parenthesis in the text for the first .//Name/ElementValue in the .//Measure
I'll admit I haven't done an extensive analysis on the best choice for extracting the gene symbol from the xml, and the solution I just tossed your way ^ might be incorrect, but right now the symbol column in your output is not always reliable.
Also, check out (the first one I noticed via grep) https://www.ncbi.nlm.nih.gov/clinvar/variation/41400 for an example where there's no "Preferred" gene symbol, but there is a symbol in parenthesis given that seems to check out. The XML does not have a "Preferred" symbol in this case.
Hi there, Really glad you guys shared your code and parsed data on github!
Mostly this issue is just to put it on your radar for something to improve for the next version, but notice that for the following entry we get conflicting information for the gene symbol:
On ClinVar it's EGFR: https://www.ncbi.nlm.nih.gov/clinvar/variation/45271/ For the hgvs_c we get EGFR: https://www.ncbi.nlm.nih.gov/nuccore/NM_005228
But in
clinvar_alleles.single.b37.tsv
we get EGFR-AS1:This can be traced to the following code from parse_clinical_xml.py:
Notice how we
break
after the first success, but for this example we have multiple "Preferred" symbols in the xml in a confusing order.Perhaps one might want to simply use the gene symbol given in parenthesis in the text for the first .//Name/ElementValue in the .//Measure
I'll admit I haven't done an extensive analysis on the best choice for extracting the gene symbol from the xml, and the solution I just tossed your way ^ might be incorrect, but right now the symbol column in your output is not always reliable.
Also, check out (the first one I noticed via grep) https://www.ncbi.nlm.nih.gov/clinvar/variation/41400 for an example where there's no "Preferred" gene symbol, but there is a symbol in parenthesis given that seems to check out. The XML does not have a "Preferred" symbol in this case.