ning / compress

High-performance, streaming/chunking Java LZF codec, compatible with standard C LZF package
Other
254 stars 40 forks source link

Parallel LZF #14

Closed whoschek closed 10 years ago

whoschek commented 12 years ago

Hi Tatu,

I think using wide instructions via Unsafe looks very promising for decompression. I was wondering to what extent this might also be applied to compression. Compression is becoming more and more the bottleneck.

I am also thinking it might be good to use the plentiful cores available on today's production systems. Here we typically use servers with 12 cores each (2 sockets with 6 cores each). It's a bit of a pity to see only one out of 12 cores being used for compression. When the 20 core Intel CPUs (2 sockets with 10 cores each) become cheaper folks will switch to those. Using many cores for compression might work very well. Decompression is already so fast that it might not benefit as much.

Here, we are already using pigz - a parallel gzip implementation (really: drop-in replacement for gzip) for some things like backup - http://www.zlib.net/pigz/. Using this with many cores the I/O subsystem becomes the bottleneck even with gzip compression, which is, as you know, computationally much more demanding than LZF compression.

Here is an idea for your project. Perhaps ning compress could have one reader thread, one writer thread, and N worker threads. The reader feeds a queue of buffers. The worker threads grab a buffer from the queue, compress it (or uncompress it) and hand an output buffer to another queue where a writer thread grabs it, serializes the order in which output buffers are written (PriorityQueue), and then writes the logically next buffer to the output stream once that logically next buffer becomes available. Each buffer could contain an LZF chunk. To keep the CPUs busy the reader queue might maintain a read-ahead of 2 buffers for each of the N worker threads. Once the output buffer is written it is recycled via a buffer pool or queue (no gc). Same recycling for input buffers. Perhaps N=1 could be special cased to use no threads at all.

The pigz C source code available at http://www.zlib.net/pigz/ contains good comments in this direction.

Regards, Wolfgang.

cowtowncoder commented 12 years ago

Interesting. Yes, this sounds like a very interesting idea. Thank you for suggesting it. I had actually seen a reference to this, but had forgotten to read more.

Btw, I was unable to use Unsafe tricks for single-threaded compression so far -- I tried, but in the end failed. :-)

Another sort of related thing that I have been toying with (the idea, that is, not implementation) is ability to define "blindly splittable" format. Meaning that one would be able to find a block boundary given an arbitrary point in file. This is theoretically possible, with some caveats; for one, compressor must guarantee that specific byte sequence never occurs (easiest to guarantee for sequence that would always be compressed, like string of 4 or more instances of same byte), and for another, that all blocks use compressor (otherwise "raw" data could have such sequence). But that warrants another issue obviously.

whoschek commented 12 years ago

Interesting, I once wrote a byte stuffing codec that might be useable for such a "blindly splittable" format. I still haven't figured out how to attach code to this issue tracking system. I can email you the two small (BSD licensed) java files, if you're interested.

/**

cowtowncoder commented 12 years ago

Cool thanks. I'll have a look. For what it's worth, it looks like Snappy format (alas!) might actually work with simple sequence of 4 zero bytes... for LZF, a change or two might be needed. But it too does have couple of unused bytes for the first byte of each sequence.

javabean commented 11 years ago

Hi all,

I may have a go at writing a parallel version of LZF if no-one started working on one yet. If street creds are required, I have implemented a parallel GZip compressor in Java (similar to pigz); hope this is enough! :-) (Ping me for the URL, I am not writing here for advertisement.) Tatu, would you be interested in such a contribution?

cowtowncoder commented 11 years ago

I would be absolutely thrilled to get such a contribution! Please let me know if you need help with block-level handling or such. And obviously you can add accessors if/as necessary.

I would also be interested in link to the project if that's ok; maybe tweet to 'cowtowncoder'? I am ok with adding link in this issue as well unless you don't want to.

Finally: this package has small gzip wrapper, so if you have improvements to that, those would be welcome. But it's mostly just added for my own use (I handle smallish payloads with gzip, larger with lzf).

cowtowncoder commented 10 years ago

Will be in 0.9.9.