macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

[WIP] Propagating ordered fields #33

Closed kristjaneerik closed 7 years ago

kristjaneerik commented 7 years ago

Overview

28 added two really useful columns, clinical_significance_ordered and review_status_ordered to the XML parser, but those columns didn't propagate through the rest of the pipeline. This PR fixes that and adds two more related columns, submitters_ordered and dates_ordered.

Additional changes

I added a few other changes as well:

Testing

I first ran master locally.

This gave a diff with the original output directory in master, presumably due to slight differences in the reference genome used; I used ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz and ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz. Additionally, I used vcf-concat gnomad.genomes.r2.0.1.sites.coding.autosomes.vcf.gz gnomad.genomes.r2.0.1.sites.coding.X.vcf.gz to generate the gnomad.genomes.r2.0.1.sites.coding.autosomes_and_X.vcf.gz input file for -GG. When I looked into the diffs, it seemed they were all due to soft-masking in the genome version I used. Something I saw was that for some of the multi alleles, one of the multiple entries got dropped out due to softmasking while the other did not, e.g. in a compound het. Perhaps if one variant is dropped, the other should be as well?

I then ran this PR and compared the results. The stats files did not change, except for the column numbers.

I also changed test_group_by_allele.py, because I didn't quite understand the examples that were used there -- they aren't the same chrom/ref/pos/alt when you look at the HGVS name. I would rather use completely fake data that is more succinct or completely real data than real data that's forced to be wrong for the sake of the test. I've changed the tests to use programmatic fake data that is hopefully more easier to follow. I added a single TODO note there about pathogenic/benign/conflicted/gold_stars -- as it stands, the grouping ignores those fields, but conceivably the different entries could have different values such that these fields would have to be re-evaluated. That would mean the test and test data would have to change, hence the note there. Thoughts?

Data

I did not include the new tables in this PR due to the diff to master I observed. Could you document what versions/sources of input files were used in the README? I can then run the PR with the right reference genomes and add in the new May tables.

bw2 commented 7 years ago

@kristjaneerik thanks for the PR + writeup. I agree the _ordered columns are a useful addition.

To your question about references, I was running the pipeline using non-masked versions of hg19 and hg38 from UCSC (http://hgdownload.cse.ucsc.edu/downloads.html#human)

Right now, when I merge your branch with master, I'm running into the error below when I run

python2.7 master.py --b37-genome ~/hg19.fa --b38-genome ~/hg38.fa -E ~/ExAC.r1.sites.vep.vcf.gz -GG ~/gnomad.genomes.r2.0.1.sites.coding.autosomes_and_X.vcf.gz -GE ~/gnomad.exomes.r2.0.1.sites.vcf.gz
[May 11 15:42:27]: --> Exec 3.44: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-11_12.40.31.clinvar_alleles_combined.single.b37.tsv.gz
[May 11 15:42:27]:         Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz  [doesn't exist yet]
[May 11 15:43:02]:         [1] 282106     27
[May 11 15:43:33]:         [1] 633182     30
[May 11 15:43:38]:         Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol,  :
[May 11 15:43:38]:           unimplemented type 'list' in 'EncodeElement'
[May 11 15:43:38]:         Calls: write.table
[May 11 15:43:52]:         $`write.table(combined, output_table, sep = "\\t", row.names = F, col.names = `
[May 11 15:43:52]:         <environment: 0x2f521f58>
[May 11 15:43:52]:
[May 11 15:43:52]:         attr(,"error.message")
[May 11 15:43:52]:         [1] "Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol,  : \n  unimplemented type 'list' in 'EncodeElement'\nCalls: write.table\n"
[May 11 15:43:52]:         attr(,"class")
[May 11 15:43:52]:         [1] "dump.frames"
[May 11 15:43:52]:     Finished 3.44. Running time: 0:01:25.116371 sec.

@XiaoleiZ has tracked down the cause of the error - which is that the “ClinicalSignificance" field is set to “-" for many variants (even though they have a clinical significance listed on the clinvar website). The "-" causes the join_data.R error. We're planning to use a previous version of the variant_summary file until the data is fixed.

Ps. I'd like to replace the join_data.R script with a python equivalent at some point to get rid of the R dependency.

kristjaneerik commented 7 years ago

@bw2 I agree with your plan, it was something that I was considering as well -- so I rewrote join_data.R in Python & pandas :)

I added some things that I had been thinking about for a while, e.g. dealing with things like "no known pathogenicity", which should be classified as benign but was be considered pathogenic by join_data.R.

I'm setting this to a "WIP" PR, since I haven't run the full pipeline with this new script.

James-CG commented 7 years ago

Is this likely to be merged any time soon? Failing that, is it useable in its current state? We have found this pipeline extremely useful and would like to be able to use it with the most up to date clinvar TSVs.

XiaoleiZ commented 7 years ago

Thank you for your interest in our tool. Currently we haven't fixed the pipeline due to problems in using the latest variant_summary.txt file. We have contacted ClinVar team on it and would update the pipeline soon hopefully. Right now if you want to use the pipeline smoothly, please use the variant_summary file on March 2017 release (downloadable from ClinVar FTP) and add the option -s /path/to/March_version.variant_summary.txt.gz when running the master.py script.

James-CG commented 7 years ago

Ah. What were these issues? I am having problems with variants that show in the XML but are not in the TSV. For example allele ID 153767 is in the XML but not TSV.

bw2 commented 7 years ago

@James-CG the issue you're seeing is only on the branch and not in master, right?

James-CG commented 7 years ago

Ah, sorry. I am having some problems with the ClinVar source files, and wanted to check if they were the same ones you were having, and why you recommended not using the latest variant_summary.tsv. I am using a fork of master that I've changed to parse out additional variables.

I'm interested in the issues @XiaoleiZ mentioned with the Clinvar source files, it just happened this thread brought them up.

bw2 commented 7 years ago

@kristjaneerik you put down this PR as work-in-progress. Have you had a chance to test it out? Should we move toward merging it in?

kristjaneerik commented 7 years ago

@bw2 I believe I simply wanted to do more thorough testing: running the full pipeline with the two reference genomes and comparing the outputs to previous versions. If that checks out it's good to go. Sorry about the delays, will try to get to this soon!

XiaoleiZ commented 7 years ago

I have a new PR based yours and also try to fix most of issues reported by others and the paper referees' reports. https://github.com/macarthur-lab/clinvar/pull/41

kristjaneerik commented 7 years ago

Wonderful, thank you!