macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Clinical Significance Order Lost in Allele Grouping #48

Closed raymond301 closed 6 years ago

raymond301 commented 7 years ago

The clinical sig is lost per submitter, in the group_by_allele.py.

For example: https://www.ncbi.nlm.nih.gov/clinvar/variation/92428/ Has 4 submitters, 1 calls Likely Benign 3 calls Benign.

clinvar_allele_trait_pairs.single.tsv.gz | grep 39073 | grep 53676401 | less Has 3 lines, with the correct Clin Sig Order.

But clinvar_alleles_grouped.single.tsv.gz | grep 39073 | grep 53676401 | less Reduces it to "benign;likely benign" indicating 2 entries, for 4 submitters. The order is lost.

BUT.... submitters_ordered is still correct: EGL Genetic Diagnostics,Eurofins Clinical Diagnostics;GeneReviews;Illumina Clinical Services Laboratory,Illumina;Center for Pediatric Genomic Medicine,Children's Mercy Hospital and Clinics

XiaoleiZ commented 7 years ago

If you are looking for all the reported significances, you can look at the columns: pathogenic, likely pathogenic, uncertain_significance, likely_benign and benign. They record the number of submissions for clinical significances correspondingly.

If you are looking for a date-ordered list of clinical significances, the current pipeline did not have this function.

kristjaneerik commented 7 years ago

https://github.com/macarthur-lab/clinvar/pull/33 was meant to address this. I thought you incorporated my fixes @XiaoleiZ ?

See e.g. https://github.com/macarthur-lab/clinvar/pull/33/files#diff-7e4b0936672060588ac6388eac4f2992

XiaoleiZ commented 7 years ago

Thanks for pointing out @kristjaneerik. I found out I did not include this part. But the order would not be kept in this way. We should add the time info in the parsing XML step.

kristjaneerik commented 7 years ago

If you look at the rest of the diff I did that too, e.g. https://github.com/macarthur-lab/clinvar/pull/33/files#diff-850079ba25065febf15fcf8c34207f57L135

raymond301 commented 7 years ago

Is this being worked on? Putting ordered fields back into the result set?

kristjaneerik commented 7 years ago

I unfortunately don't have the time to pick this up again, but the code is all there in my PR #33. It was basically good to go, I just didn't have a time to do a thorough comparison of the results to make sure no bugs were introduced.

bw2 commented 6 years ago

PR #33 has been merged so I'm closing this issue.

kristjaneerik commented 6 years ago

@bw2 yep, #33 was merged, but it looks like @XiaoleiZ reverted the changes that fixed this bug in #41 if you look at e.g. group_by_allele.py in https://github.com/macarthur-lab/clinvar/commit/03aa390d4c79176ca6ed36b65eea638edee5eb05#diff-7e4b0936672060588ac6388eac4f2992L75