Open hendrikmuhs opened 9 years ago
I would definitly make sence to compress the blocks, instead of compressing the individual text strings. If @mortal has time perhaps he can tell us what the best approach will be. If you want to implement this that is good, we can probably allocate some time for @svendcsvendsen to help you.
Using Snappy for compression in the serialization_sorter definitely makes a lot of sense for situations like this. @mortal implemented the serialization code and knows most about it, however i'll definitely be available if you need some help in regards to the implementation.
Actually, block-wise compression makes more sense for serialization streams than ordinary streams, since serialization streams do not support seek.
The four stream classes serialization{_reverse,}{_reader,_writer}
are derivations of bits::serialization_{reader,writer}_base
, and the two base classes implement read_block and write_block which the stream classes use more or less as a black box.
Compressed serialization streams should ideally be implemented to use the compressor thread, passing in read and write requests which support both forward and backward reading -- exactly what the serialization_reverse_reader needs.
Perhaps process_read_request and process_write_request are a good place to start learning how the compressed streams work.
Hi,
I am using serialization_sorter.h to sort huge amounts of key-value data (strings, variable length).
Is it possible and do you think it makes sense to implement snappy compression for it? What would be the best place?
I would think here: https://github.com/thomasmoelhave/tpie/blob/master/tpie/serialization_stream.h
I also considered compressing at least the values myself in serialize and unserialize but as my values are something like 50-400 characters it will not be very effective to compress these short strings separately.
I think block-wise compression would make more sense.
(I would implement it myself and send you a PR)