pymzml / pymzML

pymzML - an interface between Python and mzML Mass spectrometry Files
https://pymzml.readthedocs.io/en/latest/
MIT License
163 stars 92 forks source link

Question: Blocked gzip (bgzf) vs igzip #89

Closed hroest closed 6 years ago

hroest commented 6 years ago

I am trying to understand the difference between blocked gzip [1], [2] used in genomics and the igzip format used here. Is the implementation the same but named differently or this a completely different implementation? Will the tools developed for bgzf also work on igzip compressed mzML files? Given that the format is pretty common in genomics, I wonder whether it would make sense to support this as well?

  1. http://www.htslib.org/doc/bgzip.html
  2. https://blastedbio.blogspot.ca/2011/11/bgzf-blocked-bigger-better-gzip.html
fu commented 6 years ago

Hi Hannes,

it is not quite the same for two major reasons. a) our index is not an additional file (we were using distributed filesystems and the allocation block is so large compared to the index file size that it was really a waste of disk space. I am aware of alternatives that solve that problem but we, in house, had that problem at the time :)

b) block size is not limited to 64 kb (and average MS1 on a newer machine is, as you most probably know 100k+). The blocks themselves can be variable in size; defined by the user during indexing.

The spirit of igzip is data centric not involving file position book keeping by the user. In other words, using alternative solutions, an interface has to created that converts the chunk size of the data one is interested in into file positions and the concatenate the right blocks. igzip stores the data as data blocks one is interested in and removes the need to tinker with file positions. The index can be any string.

Hope that helps

Cheers

.c