moinakg / pcompress

A Parallelized Data Deduplication and Compression utility
http://moinakg.github.com/pcompress/
GNU Lesser General Public License v3.0
277 stars 34 forks source link

Feature request: random seeks #9

Open epitron opened 11 years ago

epitron commented 11 years ago

Hey again! :)

The pcompress format is really sweet. I like that you have metadata blocks!

Have you considered offering the ability to decompress and output an arbitrary sequence of bytes from the original (uncompressed) file? (Essentially, seek(pos), read(length).)

Being able to randomly read a compressed file would be quite handy, if the user creates an index of what's inside it.

The algorithm for seeking could be very simple -- I'm not too concerned about performance. I just don't want to have to read 5 gigabytes of data to get one block from the middle of the file. I also don't want to have to figure out what byte range each compressed block covers.

Would it be possible to store the (original uncompressed) byte range of each compressed block in .pz's metadata?

P.S.: I don't know if you've seen this, but pixz is really cool: https://github.com/vasi/pixz It has a feature which indexes the tarballs it creates, so that you can read/list random files instantly. :)

moinakg commented 11 years ago

Thanks for the suggestion. Yes I have considered this and it is quite easy to insert an index at the end of the file. It is also easy to create the index on the fly just by reading the headers and skipping the data blocks. Each compressed block header already has complete information of the compressed and uncompressed sizes. Otherwise it will be impossible to decompress them.

So it is possible to provide the feature of an arbitrary byte range extraction. Thanks for the pixz link. I will check it out and look at implementing this feature.

epitron commented 11 years ago

Fantastic!

Maybe I'll poke around and see if I can add the seek/read feature myself.

moinakg commented 11 years ago

Great. Feel free to email me if you need clarifications. I have been lazy to document the file format. I need to get around to doing that.