mourisl / Lighter

Fast and memory-efficient sequencing error corrector
GNU General Public License v3.0
92 stars 17 forks source link

Optional level-1 gzip output? #14

Closed lh3 closed 9 years ago

lh3 commented 9 years ago

Given compressed input, lighter only uses one CPU core during the error correction phase even if a large -t is specified. I guess this is because writing compressed fastq is much slower than the multi-threaded error correction. It would be good to let user specify the zlib compression level of the output. The default is level 6. Level-1 is times faster without much loss on the compression ratio.

In the long term, an even better solution is to put file I/O on a separate thread, such that lighter could write the compressed output while correcting reads. Quite a few high-performance tools (e.g. kmc and jellyfish, I believe) are using this trick. It is very effective.

mourisl commented 9 years ago

Thank you very much! I didn't notice the compression level at all! I've provided the option and set the default compression level as 1. I'm working on optimizing the parallelizations these days and also implemented the method you suggested above. But there seems no obvious speed up when the number of threads is large on my machine. Nevertheless, I upload it and will try other implementations.

mourisl commented 9 years ago

I just uploaded a better implementation and it improves the speed especially for the gzip'ed files. Please let me know whether it works. Thanks.

lh3 commented 9 years ago

This is much better. On my data, wall-clock time is reduced from 23h to 5h on compressed input (3h15m on uncompressed input). I'll close the issue.