Open jasper1918 opened 5 years ago
Hi Jasper,
From a philosophical perspective I think you may have a good point. But I'm going to vote against it for now. Part of the reason is that I'd have to learn to use a new library and may require a lot of rewriting of code to make that happen. So it sounds like a high effort but low benefit option.
A more important issue is that this would create a dependency on an external library: htslib. vcfR
currently uses zlib
to work with 'standard gzip'. zlib
is very popular so it will be on most machines including those at CRAN. Even though I agree that bcftools
and htslib
are well accepted in the bioinformatics community once you step outside that community I doubt anyone will know what you're talking about. Because of this I think your suggestion will create an external dependency that will not build on the CRAN computers and that means they will stop hosting vcfR
. I see a lot of value in having packages on CRAN, so I see that as a unfavorable choice.
Can you provide an example of why this is important? Personally, I do not need this feature. Although, my goal is usually to work in R. If your goal is to create VCF files for processing with other software there may be an issue. But I've found that 'standard gzip' is sufficient for my needs. Do you see something I don't? Thanks!
Hey Brian- Thanks for the nice overview.
My use case is meant to be an Rscript run from a command line as part of a workflow. I do need VCFs in bgzip format to be compatible with the universe of other tools in the bioinformatics community. I do agree that not every use case requires a bgzip/indexed VCF but I don't know of any tool that takes a gzip VCF. Some of the most common next steps in a pipeline post VCF filtering would be annotation or aggregation. These steps do require a bgzip and index file.
The workaround is simple enough. I can unzip and bgzip in a single stream. Also understand if you have other development priorities. In case you would like to implement, I looked on CRAN for latest packages that have bgzf.c and found https://github.com/cran/seqminer/tree/master/src. My preference would be to have vcfR write bgzip and index because of full compatibility and just as useful as gzip since it can be decoded with the standard library.
Thanks again for making vcfR available and for considering this feature
Hi Jasper, thanks for sticking with me on this. And thanks for pointing out seqminer, I was not aware of it. I like to 'think out loud' here, so here goes. A potential solution is to use bgzip which is a part of htslib.
> library(vcfR)
***** *** vcfR *** *****
This is vcfR 1.8.0
browseVignettes('vcfR') # Documentation
citation('vcfR') # Citation
***** ***** ***** *****
> data("vcfR_test")
> vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
***** ***** *****
> write.vcf(vcfR_test, file = "myData.vcf.gz")
> library(vcfR)
> data("vcfR_test")
> vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
***** ***** *****
> write.vcf(vcfR_test, file = "myData.vcf.gz")
> system("gunzip myData.vcf.gz")
> system("~/bin/htslib-1.9/bgzip -i myData.vcf")
> list.files()
[1] "myData.vcf.gz" "myData.vcf.gz.gzi" "vcfR.html" "vcfR.Rmd" "vcfR.Rproj"
>
This accomplishes our goal. But there are several reasons this appears undesirable.
vcfR::write.vcf()
currently only writes to *vcf.gz (zlib gz) so we have to uncompress the file and then recompress with bgzip
(which appears to only work with uncompressed VCF) which may present a bottle neckSo I think we're at "I like the idea, but I do not know how to implement it." Perhaps the path would be to reach out to the authors of seqminer for some guidance. Although, my quick look at the package makes me think they are mostly about reading the data in and not out. We could also reach out to the samtools team to see if we can get guidance from them.
Excellent package! I have prototyped a few solutions to use this to apply ML classification of called variants and do filtering.
This is more of a feature request. Just wanted to add that VCFs should block gzipped(bgzf) and not standard gzip. The former is necessary to index the VCF using bcftools and is widely considered a standard. https://samtools.github.io/bcftools/bgzf-aes-encryption.html
Thanks for all your efforts in making vcfR available.