marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Compression level revisit #808

Open rhpvorderman opened 1 month ago

rhpvorderman commented 1 month ago

Things have changed since #425:

Running the following command: /usr/bin/time cutadapt --compression-level X -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ramdisk/out_r1.fastq.gz -p ramdisk/out_r2.fastq.gz ~/test/5millionreads_R1.fastq.gz ~/test/5millionreads_R2.fastq.gz && wc -c ramdisk/*.fastq.gz

Compression level runtime (s) filesize (MiB)
5 (default) 78.4 693
4 69.1 710
3 55.5 740
2 36.6 781
1 36.2 781
0 (no compression in gzip container) 31.8 3405
None (no gzip) 31.0 3405
Relative to compression level 1 Compression level runtime filesize
5 (default) 2.17 0.89
4 1.91 0.91
3 1.53 0.95
2 1.01 1.00
1 1.00 1.00
0 (no compression in gzip container) 0.88 4.36
None (no gzip) 0.86 4.36

Current defaults:

marcelm commented 1 month ago

I’ve recently dealt with an issue in strobealign that made me a bit more sensitive to the relative overhead introduced by using compressed files. It turned out that decompressing (not even compressing) the input FASTQ was preventing us from using more than ~20 threads at a time. Someone contributed a PR that switches to using ISA-L for decompression and decompressing in a separate thread. This now allows us to saturate 128 cores.

So I’m inclined to agree the default compression level can be reduced further. What’s your suggestion?

(My view is or maybe was still a bit colored by the disk space quota limits I hit regularly. I guess I kind of want to help other people avoid those. But then I also see people storing totally uncompressed FASTQ and even SAM files ...)

rhpvorderman commented 1 month ago

My gut feeling is to use about 10% of the compute time for the compression and compress as good as possible. Using less than 10% of the compute time hardly makes a difference in the overall runtime. Using more seems wasteful to me. It seems as ISA-L zlib compression manages that at around ~12% of the compute time to give a very small result. So it sort of hits the sweet spot for me. I always use the -Z flag. But I am probably one of the most biased guys on the internet when it comes to compression, don't take my word for it ;-).

The problem with gzip is that the decompression can be hardly be multithreaded. Other formats are a bit better at this. On the other hand 1GB/s decompression is quite fast already.

Someone contributed a PR that switches to using ISA-L for decompression and decompressing in a separate thread. This now allows us to saturate 128 cores.

Nice, for paired end data that gives you a 2GB/s input stream right? That's a lot of data to run local alignment on. Do you use any vectorized libraries for the smith-waterman already?

(My view is or maybe was still a bit colored by the disk space quota limits I hit regularly. I guess I kind of want to help other people avoid those. But then I also see people storing totally uncompressed FASTQ and even SAM files ...)

I can relate. Running out of disk space happens frequently here at our institute too. But the 10% extra compression of gzip level 5 compared to level 1 is just not cutting it in that case. if I need to cut the whole WGS run into 4 batches to make sure I don't run into disk space issues, 10% is not helpful. 50% better (files that are 66% the size) helps a lot, because then I can run just 3 batches. In the case of 4 batches, I rather lose 10% extra disk space, if it means my jobs finish a lot faster. It means I can finish the project faster. I concede that this viewpoint is very much coloured by my use case.