knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

VCF files should be block-gzipped and not gzipped #124

Open jasper1918 opened 5 years ago

jasper1918 commented 5 years ago

Excellent package! I have prototyped a few solutions to use this to apply ML classification of called variants and do filtering.

This is more of a feature request. Just wanted to add that VCFs should block gzipped(bgzf) and not standard gzip. The former is necessary to index the VCF using bcftools and is widely considered a standard. https://samtools.github.io/bcftools/bgzf-aes-encryption.html

Thanks for all your efforts in making vcfR available.

knausb commented 5 years ago

Hi Jasper,

From a philosophical perspective I think you may have a good point. But I'm going to vote against it for now. Part of the reason is that I'd have to learn to use a new library and may require a lot of rewriting of code to make that happen. So it sounds like a high effort but low benefit option.

A more important issue is that this would create a dependency on an external library: htslib. vcfR currently uses zlib to work with 'standard gzip'. zlib is very popular so it will be on most machines including those at CRAN. Even though I agree that bcftools and htslib are well accepted in the bioinformatics community once you step outside that community I doubt anyone will know what you're talking about. Because of this I think your suggestion will create an external dependency that will not build on the CRAN computers and that means they will stop hosting vcfR. I see a lot of value in having packages on CRAN, so I see that as a unfavorable choice.

Can you provide an example of why this is important? Personally, I do not need this feature. Although, my goal is usually to work in R. If your goal is to create VCF files for processing with other software there may be an issue. But I've found that 'standard gzip' is sufficient for my needs. Do you see something I don't? Thanks!

jasper1918 commented 5 years ago

Hey Brian- Thanks for the nice overview.

My use case is meant to be an Rscript run from a command line as part of a workflow. I do need VCFs in bgzip format to be compatible with the universe of other tools in the bioinformatics community. I do agree that not every use case requires a bgzip/indexed VCF but I don't know of any tool that takes a gzip VCF. Some of the most common next steps in a pipeline post VCF filtering would be annotation or aggregation. These steps do require a bgzip and index file.

The workaround is simple enough. I can unzip and bgzip in a single stream. Also understand if you have other development priorities. In case you would like to implement, I looked on CRAN for latest packages that have bgzf.c and found https://github.com/cran/seqminer/tree/master/src. My preference would be to have vcfR write bgzip and index because of full compatibility and just as useful as gzip since it can be decoded with the standard library.

Thanks again for making vcfR available and for considering this feature

knausb commented 5 years ago

Hi Jasper, thanks for sticking with me on this. And thanks for pointing out seqminer, I was not aware of it. I like to 'think out loud' here, so here goes. A potential solution is to use bgzip which is a part of htslib.

> library(vcfR)

   *****       ***   vcfR   ***       *****
   This is vcfR 1.8.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****

> data("vcfR_test")
> vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****
> write.vcf(vcfR_test, file = "myData.vcf.gz")
> library(vcfR)
> data("vcfR_test")
> vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****
> write.vcf(vcfR_test, file = "myData.vcf.gz")
> system("gunzip myData.vcf.gz")
> system("~/bin/htslib-1.9/bgzip -i myData.vcf")
> list.files()
[1] "myData.vcf.gz"     "myData.vcf.gz.gzi" "vcfR.html"         "vcfR.Rmd"          "vcfR.Rproj"       
> 

This accomplishes our goal. But there are several reasons this appears undesirable.

  1. It creates intermediate files
  2. vcfR::write.vcf() currently only writes to *vcf.gz (zlib gz) so we have to uncompress the file and then recompress with bgzip (which appears to only work with uncompressed VCF) which may present a bottle neck

So I think we're at "I like the idea, but I do not know how to implement it." Perhaps the path would be to reach out to the authors of seqminer for some guidance. Although, my quick look at the package makes me think they are mostly about reading the data in and not out. We could also reach out to the samtools team to see if we can get guidance from them.

knausb commented 5 years ago

I think I've decided that seqminer is not a solution here. It appears to read in portions of a VCF file. See ?seqminer::readVCFToMatrixByRange. In vcfR we're used to reading it all in. Or subsetting by rows. I just found Rhtslib which may be a viable path.