samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

How to read bcf file ? #1537

Open bioinfornatics opened 3 years ago

bioinfornatics commented 3 years ago

Dear,

Firstly thanks for your amazing work on this library.

I try to read a bcffile through VCFFileReader but I have this error:

Unable to parse header with error: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:, for input source: file:///somewhere/build/resources/test/dataset1.bcf

This file is valid and I am able to read it with pysam in python

$ bcftools  view -h /somewhere/build/resources/test/dataset1.bcf | head | grep fileformat
##fileformat=VCFv4.1

I use htsjdk: 2.23.0 java: 8+

I need to increase the speed process due to slowness of python and is GIL limitation which do not allow the use of multithreading to read the file through a random access way

Thanks for your help

cmnbroad commented 3 years ago

@bioinfornatics Oh - can you run od -c -N 30 <file> and include the results here. You may be hitting this embarrassing bug which we've never fixed https://github.com/samtools/htsjdk/issues/946.

bioinfornatics commented 3 years ago

@cmnbroad thanks I was trying to get the magic number as it is done into BCFVersion.java with hexdump -C but I had any idea to do next :-)

So the result of od -c -N 30

0000000 037 213  \b 004  \0  \0  \0  \0  \0 377 006  \0   B   C 002  \0
0000020 250 003 255 224 313 213 023   I 034 307 333   L 226 245
cmnbroad commented 3 years ago

Yes, that looks like a gzip header. Sadly, htsjdk lacks support for gzipped BCF, as mentioned above.

lindenb commented 3 years ago

@bioinfornatics you can try to use a VCFIteratorBuilder instead of a VCFFileReader . As far as I remember, the GZ compression was first tested. (But you cannot do random-access)

bioinfornatics commented 3 years ago

thanks @lindenb and @cmnbroad

Arf … I though using the index file (or/and the gzip block) to read the bcf file in // by region.