Closed kristjaneerik closed 7 years ago
@kristjaneerik thanks for the PR + writeup. I agree the _ordered columns are a useful addition.
To your question about references, I was running the pipeline using non-masked versions of hg19 and hg38 from UCSC (http://hgdownload.cse.ucsc.edu/downloads.html#human)
Right now, when I merge your branch with master, I'm running into the error below when I run
python2.7 master.py --b37-genome ~/hg19.fa --b38-genome ~/hg38.fa -E ~/ExAC.r1.sites.vep.vcf.gz -GG ~/gnomad.genomes.r2.0.1.sites.coding.autosomes_and_X.vcf.gz -GE ~/gnomad.exomes.r2.0.1.sites.vcf.gz
[May 11 15:42:27]: --> Exec 3.44: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-11_12.40.31.clinvar_alleles_combined.single.b37.tsv.gz
[May 11 15:42:27]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 11 15:43:02]: [1] 282106 27
[May 11 15:43:33]: [1] 633182 30
[May 11 15:43:38]: Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
[May 11 15:43:38]: unimplemented type 'list' in 'EncodeElement'
[May 11 15:43:38]: Calls: write.table
[May 11 15:43:52]: $`write.table(combined, output_table, sep = "\\t", row.names = F, col.names = `
[May 11 15:43:52]: <environment: 0x2f521f58>
[May 11 15:43:52]:
[May 11 15:43:52]: attr(,"error.message")
[May 11 15:43:52]: [1] "Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, : \n unimplemented type 'list' in 'EncodeElement'\nCalls: write.table\n"
[May 11 15:43:52]: attr(,"class")
[May 11 15:43:52]: [1] "dump.frames"
[May 11 15:43:52]: Finished 3.44. Running time: 0:01:25.116371 sec.
@XiaoleiZ has tracked down the cause of the error - which is that the “ClinicalSignificance" field is set to “-" for many variants (even though they have a clinical significance listed on the clinvar website). The "-" causes the join_data.R error. We're planning to use a previous version of the variant_summary file until the data is fixed.
Ps. I'd like to replace the join_data.R script with a python equivalent at some point to get rid of the R dependency.
@bw2 I agree with your plan, it was something that I was considering as well -- so I rewrote join_data.R in Python & pandas :)
I added some things that I had been thinking about for a while, e.g. dealing with things like "no known pathogenicity", which should be classified as benign but was be considered pathogenic by join_data.R.
I'm setting this to a "WIP" PR, since I haven't run the full pipeline with this new script.
Is this likely to be merged any time soon? Failing that, is it useable in its current state? We have found this pipeline extremely useful and would like to be able to use it with the most up to date clinvar TSVs.
Thank you for your interest in our tool. Currently we haven't fixed the pipeline due to problems in using the latest variant_summary.txt file. We have contacted ClinVar team on it and would update the pipeline soon hopefully. Right now if you want to use the pipeline smoothly, please use the variant_summary file on March 2017 release (downloadable from ClinVar FTP) and add the option
-s /path/to/March_version.variant_summary.txt.gz
when running the master.py
script.
Ah. What were these issues? I am having problems with variants that show in the XML but are not in the TSV. For example allele ID 153767 is in the XML but not TSV.
@James-CG the issue you're seeing is only on the branch and not in master, right?
Ah, sorry. I am having some problems with the ClinVar source files, and wanted to check if they were the same ones you were having, and why you recommended not using the latest variant_summary.tsv. I am using a fork of master that I've changed to parse out additional variables.
I'm interested in the issues @XiaoleiZ mentioned with the Clinvar source files, it just happened this thread brought them up.
@kristjaneerik you put down this PR as work-in-progress. Have you had a chance to test it out? Should we move toward merging it in?
@bw2 I believe I simply wanted to do more thorough testing: running the full pipeline with the two reference genomes and comparing the outputs to previous versions. If that checks out it's good to go. Sorry about the delays, will try to get to this soon!
I have a new PR based yours and also try to fix most of issues reported by others and the paper referees' reports. https://github.com/macarthur-lab/clinvar/pull/41
Wonderful, thank you!
Overview
28 added two really useful columns,
clinical_significance_ordered
andreview_status_ordered
to the XML parser, but those columns didn't propagate through the rest of the pipeline. This PR fixes that and adds two more related columns,submitters_ordered
anddates_ordered
.Additional changes
I added a few other changes as well:
add_gnomad_fields.py
won't crash when it reaches a chromosome Y entrygunzip -c
instead ofzcat
--output-prefix
flag that defaults to../output/
, but can be used to generate output directories specific to e.g. a ClinVar version. I use it e.g.--output-prefix '../output-2017-03/'
. Just a convenience thing to reduce somemv
operations.test_group_by_allele.py
, see belowTesting
I first ran
master
locally.This gave a diff with the original
output
directory inmaster
, presumably due to slight differences in the reference genome used; I used ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz and ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz. Additionally, I usedvcf-concat gnomad.genomes.r2.0.1.sites.coding.autosomes.vcf.gz gnomad.genomes.r2.0.1.sites.coding.X.vcf.gz
to generate thegnomad.genomes.r2.0.1.sites.coding.autosomes_and_X.vcf.gz
input file for-GG
. When I looked into the diffs, it seemed they were all due to soft-masking in the genome version I used. Something I saw was that for some of themulti
alleles, one of the multiple entries got dropped out due to softmasking while the other did not, e.g. in a compound het. Perhaps if one variant is dropped, the other should be as well?I then ran this PR and compared the results. The stats files did not change, except for the column numbers.
I also changed
test_group_by_allele.py
, because I didn't quite understand the examples that were used there -- they aren't the same chrom/ref/pos/alt when you look at the HGVS name. I would rather use completely fake data that is more succinct or completely real data than real data that's forced to be wrong for the sake of the test. I've changed the tests to use programmatic fake data that is hopefully more easier to follow. I added a single TODO note there about pathogenic/benign/conflicted/gold_stars -- as it stands, the grouping ignores those fields, but conceivably the different entries could have different values such that these fields would have to be re-evaluated. That would mean the test and test data would have to change, hence the note there. Thoughts?Data
I did not include the new tables in this PR due to the diff to
master
I observed. Could you document what versions/sources of input files were used in the README? I can then run the PR with the right reference genomes and add in the new May tables.