peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Look at FASTP in place of (trimmomatic and) flash? #92

Open peterjc opened 5 years ago

peterjc commented 5 years ago

This isn't currently a bottleneck, but if time allows we might want to look at FASTP?

peterjc commented 5 years ago

As a bonus, FASTP v0.19.8 onwards can now do read merging as well - so could replace Pear / Flash as well:

https://github.com/OpenGene/fastp#merge-paired-end-reads

peterjc commented 4 years ago

It works, proof of principle based on using:

def run_fastp(left_in, right_in, pair_out, adapters=None, debug=False, cpu=0):
    """Run FASTP to do quality trimming and merge overlapping pairs.

    The input FASTQ files may be gzipped.
    """
    cmd = [
        "fastp",
        "-i",
        left_in,
        "-I",
        right_in,
        "-m",
        "--merged_out",
        pair_out,
        "--detect_adapter_for_pe",
    ]
    # Quoting FASTP documentation:
    #
    #     The most widely used adapter is the Illumina TruSeq adapters.
    #     If your data is from the TruSeq library, you can add
    #     --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
    #     --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
    #     to your command lines, or enable auto detection for PE data
    #     by specifing detect_adapter_for_pe.
    #
    if cpu:
        # -w, --thread
        # worker thread number, default is 2 (int [=2])
        cmd += ["-w", str(cpu)]
    return run(cmd, debug=debug)

This isn't any faster on my desktop, run time seems about the same.

Turning off the adapter trimming (-A) appears to cut the prepare-read time in half, and yet gave the same output (at least at -a 100 abundance threshold) as with it.

peterjc commented 4 years ago

Switching to --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA and --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT (i.e. hard coding for TruSeq, much like how we invoke Trimmomatic) is equally fast, and gives practically the same output once again (just in one sample the top read abundance over 20k rose by one).

peterjc commented 4 years ago

Attempting to match the apparently more relaxed merging rules of Flash makes a slight difference to the output (+/- one on the read abundances), but that's all. No change in speed:

        # FASTP default minium overlap length is 30,
        # Flash default is -m 10
        "--overlap_len_require",
        "10",
        # FASTP default overlap diff percentation is 20,
        # Flash default is -x 25
        "--overlap_diff_percent_limit",
        "25",
peterjc commented 2 years ago

I logged this issue before dropping trimmomatic and just using flash directly on the raw reads (#314). We still call cutadapt afterwards, but now do demultiplexing of multiple amplicon primers at that point.

As far as I can see, FASTP does not cover the primer detection, trimming and demultiplexing - but it could replace (trimmomatic and) flash.

It would be worth repeating the above experiments with giving raw data to flash (currently pipeline) vs raw data to FASTP.

In addition to possibly being faster, FASTP is still under active development https://github.com/OpenGene/fastp/graphs/code-frequency (recent activity in 2021 after a quite few years), whereas Flash is stable at v1.2.11 from August 2014 at http://ccb.jhu.edu/software/FLASH/ or https://sourceforge.net/projects/flashpage/files/