Open romanzenka opened 5 years ago
Hello, I'd say this is relevant to #124 as well. Here is the documentation for tabix. I believe that bgzip -i
creates the same index that tabix does. Could you validate that for me?
@knausb I do not think bgzip -i
creates the same index. This is what source code of bgzip says:
fprintf(fp, " -i, --index compress and create BGZF index\n");
fprintf(fp, " -I, --index-name FILE name of BGZF index file [file.gz.gzi]\n");
The thing is, I do not see any input parameters for bgzip
that would let you specify chromosome and position columns to base the index on. So I believe the BGZF index file
must be something else - probably a map of all block offsets and their starts in the original file. In other words - this would let you seek if you know byte offsets to read from, but it won't let you make a query "chromosome 2, positions 10,000-20,000" like Tabix does. Would need to dig further to confirm.
The index created by bgzip
is *.gzi
while tabix
creates *.tbi
and the tabix index is a larger file. So they are not the same.
Yeah. Well, we need the tabix index - how to go from genomic coordinates to a chunk of data.
Thanks for the clarification in the indexes. I think you're correct. I found Rhtslib which I do not know how to use. But it looks like it may be a solution.
Ah, I just contributed a pull request to samtools/htsjdk#1248 which is a Java version of htslib. I do not have experience with the C version, but if it is anything like the java one, that's definitely a way to go.
Except I looked at Rhtslib and it is literally nothing but a wrapper around htslib. It has no interface, I believe you have to write C code to link to it. So it's not super useful for a casual R developer.
Thanks for the validation, that helps! I see Rhtslib at 1.7.0 while htslib is at 1.9.0. Does that matter? According to the documentation it sounds like they would be amenable to to incrementing Rhtslib.
"not super useful for a casual R developer" yeah, it looks like there's going to be a learning curve. But I think its the the way to go. File IO is a bottle neck and one of the things that inspired me to create vcfR. vcfR currently uses Rcpp
to get that performance. And I want to believe all of these libraries will play nicely together. But it might take me a bit to figure it all out.
@knausb So my understanding is that you wrote vcfR
using C code you made from scratch. I do not see anything wrong with your approach - htslib
is however more of a building block for many kinds of genomic I/O beyond VCF. It might be wise to join forces with them, vcfR
becoming a comfy wrapper around htslib
, inheriting all the new fetures - that said, I do not know enough about their C code quality at the moment. It is quite possible your code is better, because it is more specialized and more oriented towards the R integration. That's something that would have to be benchmarked and figured out.
This package might be useful for interfacing with tabix: https://github.com/zhanxw/seqminer
This is similar to issue #107.
We would like an interface where one can specify chromosome + region, and obtains only the parts of the VCF that correspond to this query. This would most likely mean supporting tabix indexed vcf files. Reading the whole VCF is simply not going to happen.
The current
skip
andnrows
parameters are not useful to us. To know those values, we would first have to index the vcf somehow - exactly the job for Tabix.