mattgodbolt / zindex

Create an index on a compressed text file
BSD 2-Clause "Simplified" License
620 stars 37 forks source link

bgzip support #18

Open slowkow opened 8 years ago

slowkow commented 8 years ago

Would it be possible to support files compressed with bgzip? Here's the link to source code. This would be very valuable for bioinformaticians.

Right now, here's what I get:

zindex test3.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique

Opening database test3.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 18 bytes of 129.16 MiB (0.00%)
Index reading complete
Flushing
Done
Closing database

It works after I convert from bgzip to gzip:

zcat test3.gz | gzip > test4.gz
zindex test4.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique

Warning: Rebuilding existing index test4.gz.zindex
Opening database test4.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 10 bytes of 123.81 MiB (0.00%)
Progress: 85.41 MiB of 123.81 MiB (68.98%)
Index reading complete
Flushing
Done
Closing database
mattgodbolt commented 8 years ago

I'd happily accept a patch to support this file format, but without clear documentation on what the file format is, plus a good way to "fast forward" and store partial decompression information, it may be very difficult.

schelhorn commented 7 years ago

I'd value support for this as well; the BGZF file format is gunzip compatible and the specs are here. The tabix index is published here.

mattgodbolt commented 7 years ago

Thanks for the +1. I'll see what I can do. Time for zindex/zq is seriously limited at the moment.

lonphan commented 7 years ago

+1 for bgzip.

mattgodbolt commented 7 years ago

Just trying to understand this a bit more. It seems like:

I'm not quite sure how zindex would fit into this? Perhaps someone here can share an example file and use case of queries?

At the very least zindex should support the concatenated gzip files (which is spec compliant), even if it doesn't use the tabix format in any way. There might then be an option to drop the need for the compression buffers in the zindex indices, which will make them smaller.

mattgodbolt commented 7 years ago

Ok: I now support what I believe is the bgzip format; though without understanding any of its tables etc. As bgzip is just concatenated gzip files (with extra trailer info) it should "just work". @slowkow and/or @schelhorn can you give it a go please? Again, this doesn't use or understand the tabix part.