merenlab / illumina-utils

A library and collection of scripts to work with Illumina paired-end data (for CASAVA 1.7+ pipeline).
GNU General Public License v2.0
89 stars 31 forks source link

Rapid multithreaded merging option #25

Closed semiller10 closed 4 years ago

semiller10 commented 4 years ago

This is a major addition to iu-merge-pairs that adds an option called --rapid-cores. This parameter takes an integer argument representing the number of cores to use in multiprocessing.

The purpose of this change was to speed up read merging for a workflow that I am building in Anvi'o. The new parameter is currently tailored to this workflow, only implementing merging with zero mismatches in the overlapping region. The task is now two orders of magnitude faster in rapid mode. Multithreading gives a further large speed boost and is scaling linearly. This brings runtime from ~1 day (1 thread) to 2.5 minutes (16 threads) for a large dataset of ~15 million 75 bp paired-end reads.

My multithreading approach is generalizable for chunking large input files for Python multithreading. The chunk-finding method seeks the number of evenly distributed chunk positions in the file equal to the number of available cores, returning information on the beginning and end of chunks. Each read merging worker then opens the file and seeks the starting position of the chunk. I found that determination of the end of a chunk with tell is much slower than finding a chunk-ending line string identified in the chunking method, and then confirming that the chunk-ending string is the correct string (and not a duplicate) by then calling tell to confirm the position ends the chunk.

The merging workers read each FASTQ sequence block (for read 1 and read 2) into memory and process it. The new merging method for (partial and optionally full) overlap with zero mismatches is highly optimized and heavily tested. Lots of debugging was needed to match merging results from rapid and normal modes, and to report proper statistics, but rapid merging is now validated with the various available command line options and large input files.

A small but notable change in iu-merge-pairs was the replacement of the --trim-suffix option, recently added by PR https://github.com/merenlab/illumina-utils/pull/24#issue-369977014, with --untrimmed-suffix. Now, by default, merging with full overlap (--marker-gene-stringent) removes trailing adapter sequences, and requires prompting by --untrimmed-suffix to retain these sequences in the merge.

The new multithreading capabilities can be easily extended to other read merging methods (such as the Levenshtein distance minimization of the normal mode), and can hopefully be useful in speeding up Anvi'o workflows, which rely on this program.

meren commented 4 years ago

Thank you for this, @semiller10! I thought --untrimmed-suffix could be --skip-suffix-trimming and --rapid-cores could be called --num-threads :) What do you think?

semiller10 commented 4 years ago

The new options are renamed --num-threads and --skip-suffix-trimming.

All of the command line options except --compute-qual-dicts are enabled with multithreading.

There is accurate progress tracking with multithreading.

The default distance metric used in merging in both single- and multithreaded modes is now Hamming rather than Levenshtein, as indels are not present in Illumina reads. This provides a 3x speedup.