thomasmoelhave / tpie

Templated Portable I/O Environment
Other
112 stars 24 forks source link

Snappy compression for serialization_sorter.h ??? #186

Open hendrikmuhs opened 9 years ago

hendrikmuhs commented 9 years ago

Hi,

I am using serialization_sorter.h to sort huge amounts of key-value data (strings, variable length).

Is it possible and do you think it makes sense to implement snappy compression for it? What would be the best place?

I would think here: https://github.com/thomasmoelhave/tpie/blob/master/tpie/serialization_stream.h

I also considered compressing at least the values myself in serialize and unserialize but as my values are something like 50-400 characters it will not be very effective to compress these short strings separately.

I think block-wise compression would make more sense.

(I would implement it myself and send you a PR)

antialize commented 9 years ago

I would definitly make sence to compress the blocks, instead of compressing the individual text strings. If @mortal has time perhaps he can tell us what the best approach will be. If you want to implement this that is good, we can probably allocate some time for @svendcsvendsen to help you.

svendcs commented 9 years ago

Using Snappy for compression in the serialization_sorter definitely makes a lot of sense for situations like this. @mortal implemented the serialization code and knows most about it, however i'll definitely be available if you need some help in regards to the implementation.

Mortal commented 9 years ago

Actually, block-wise compression makes more sense for serialization streams than ordinary streams, since serialization streams do not support seek.

The four stream classes serialization{_reverse,}{_reader,_writer} are derivations of bits::serialization_{reader,writer}_base, and the two base classes implement read_block and write_block which the stream classes use more or less as a black box.

Compressed serialization streams should ideally be implemented to use the compressor thread, passing in read and write requests which support both forward and backward reading -- exactly what the serialization_reverse_reader needs.

Perhaps process_read_request and process_write_request are a good place to start learning how the compressed streams work.