Open peterjc opened 5 years ago
As a bonus, FASTP v0.19.8 onwards can now do read merging as well - so could replace Pear / Flash as well:
It works, proof of principle based on using:
def run_fastp(left_in, right_in, pair_out, adapters=None, debug=False, cpu=0):
"""Run FASTP to do quality trimming and merge overlapping pairs.
The input FASTQ files may be gzipped.
"""
cmd = [
"fastp",
"-i",
left_in,
"-I",
right_in,
"-m",
"--merged_out",
pair_out,
"--detect_adapter_for_pe",
]
# Quoting FASTP documentation:
#
# The most widely used adapter is the Illumina TruSeq adapters.
# If your data is from the TruSeq library, you can add
# --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
# --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
# to your command lines, or enable auto detection for PE data
# by specifing detect_adapter_for_pe.
#
if cpu:
# -w, --thread
# worker thread number, default is 2 (int [=2])
cmd += ["-w", str(cpu)]
return run(cmd, debug=debug)
This isn't any faster on my desktop, run time seems about the same.
Turning off the adapter trimming (-A
) appears to cut the prepare-read time in half, and yet gave the same output (at least at -a 100
abundance threshold) as with it.
Switching to --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
and --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
(i.e. hard coding for TruSeq, much like how we invoke Trimmomatic) is equally fast, and gives practically the same output once again (just in one sample the top read abundance over 20k rose by one).
Attempting to match the apparently more relaxed merging rules of Flash makes a slight difference to the output (+/- one on the read abundances), but that's all. No change in speed:
# FASTP default minium overlap length is 30,
# Flash default is -m 10
"--overlap_len_require",
"10",
# FASTP default overlap diff percentation is 20,
# Flash default is -x 25
"--overlap_diff_percent_limit",
"25",
I logged this issue before dropping trimmomatic and just using flash directly on the raw reads (#314). We still call cutadapt afterwards, but now do demultiplexing of multiple amplicon primers at that point.
As far as I can see, FASTP does not cover the primer detection, trimming and demultiplexing - but it could replace (trimmomatic and) flash.
It would be worth repeating the above experiments with giving raw data to flash (currently pipeline) vs raw data to FASTP.
In addition to possibly being faster, FASTP is still under active development https://github.com/OpenGene/fastp/graphs/code-frequency (recent activity in 2021 after a quite few years), whereas Flash is stable at v1.2.11 from August 2014 at http://ccb.jhu.edu/software/FLASH/ or https://sourceforge.net/projects/flashpage/files/
This isn't currently a bottleneck, but if time allows we might want to look at FASTP?