knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Support Tabix #125

Open romanzenka opened 5 years ago

romanzenka commented 5 years ago

This is similar to issue #107.

We would like an interface where one can specify chromosome + region, and obtains only the parts of the VCF that correspond to this query. This would most likely mean supporting tabix indexed vcf files. Reading the whole VCF is simply not going to happen.

The current skip and nrows parameters are not useful to us. To know those values, we would first have to index the vcf somehow - exactly the job for Tabix.

knausb commented 5 years ago

Hello, I'd say this is relevant to #124 as well. Here is the documentation for tabix. I believe that bgzip -i creates the same index that tabix does. Could you validate that for me?

romanzenka commented 5 years ago

@knausb I do not think bgzip -i creates the same index. This is what source code of bgzip says:

fprintf(fp, "   -i, --index                compress and create BGZF index\n");
fprintf(fp, "   -I, --index-name FILE      name of BGZF index file [file.gz.gzi]\n");

The thing is, I do not see any input parameters for bgzip that would let you specify chromosome and position columns to base the index on. So I believe the BGZF index file must be something else - probably a map of all block offsets and their starts in the original file. In other words - this would let you seek if you know byte offsets to read from, but it won't let you make a query "chromosome 2, positions 10,000-20,000" like Tabix does. Would need to dig further to confirm.

knausb commented 5 years ago

The index created by bgzip is *.gzi while tabix creates *.tbi and the tabix index is a larger file. So they are not the same.

romanzenka commented 5 years ago

Yeah. Well, we need the tabix index - how to go from genomic coordinates to a chunk of data.

knausb commented 5 years ago

Thanks for the clarification in the indexes. I think you're correct. I found Rhtslib which I do not know how to use. But it looks like it may be a solution.

romanzenka commented 5 years ago

Ah, I just contributed a pull request to samtools/htsjdk#1248 which is a Java version of htslib. I do not have experience with the C version, but if it is anything like the java one, that's definitely a way to go.

Except I looked at Rhtslib and it is literally nothing but a wrapper around htslib. It has no interface, I believe you have to write C code to link to it. So it's not super useful for a casual R developer.

knausb commented 5 years ago

Thanks for the validation, that helps! I see Rhtslib at 1.7.0 while htslib is at 1.9.0. Does that matter? According to the documentation it sounds like they would be amenable to to incrementing Rhtslib.

"not super useful for a casual R developer" yeah, it looks like there's going to be a learning curve. But I think its the the way to go. File IO is a bottle neck and one of the things that inspired me to create vcfR. vcfR currently uses Rcpp to get that performance. And I want to believe all of these libraries will play nicely together. But it might take me a bit to figure it all out.

romanzenka commented 5 years ago

@knausb So my understanding is that you wrote vcfR using C code you made from scratch. I do not see anything wrong with your approach - htslib is however more of a building block for many kinds of genomic I/O beyond VCF. It might be wise to join forces with them, vcfR becoming a comfy wrapper around htslib, inheriting all the new fetures - that said, I do not know enough about their C code quality at the moment. It is quite possible your code is better, because it is more specialized and more oriented towards the R integration. That's something that would have to be benchmarked and figured out.

RichardJActon commented 4 years ago

This package might be useful for interfacing with tabix: https://github.com/zhanxw/seqminer