samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

samtools index on uncompressed data #388

Open jkbonfield opened 8 years ago

jkbonfield commented 8 years ago

Looking into the bgzf code I see that it supports reading of totally uncompressed data (eg from zcat foo.bam > bar.bam), apparently due to uncompressed bcf taking this form.

Curious as to how well this works, I tried samtools index on such a BAM file and it completed without complaint. However dies when attempting to use it:

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
[main_samview] retrieval of region "2:100000000-100001000" failed due to truncated file or corrupt BAM index file

Samtools idxstats works though and gives identical output to the compressed BAM index.

It's a rather esoteric and therefore low priority case, but we should make the indexing code choke at creation time rather than usage time. (Unless it's actually meant to work and fakes up 64k uncompressed blocks just for the purpose of indexing and virtual offsets?)

jkbonfield commented 8 years ago

Similarly, attempting to run samtools index on a gzipped BAM file rather than bgzfed causes it to get stuck in an infinite loop inside inflate_gzip_block.

This is a more general issue than index though. I can't samtools view this file either. Is this an unsupported and dead format variant that we can cull? It's clearly not in use for BAM. What about bcf or tabix?