mauranolab / mapping

2 stars 1 forks source link

Revamp trimming #186

Open mattmaurano opened 4 years ago

mattmaurano commented 4 years ago

Andre, can you please summarize our email thread here regarding possible change to trimmer in the dnase pipeline?

Include the alternative packages, the criteria I outlined, and your answers to those questions

TODO: Memory usage?

Other options not investigating further: bbduk https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-5-337 https://github.com/MikkelSchubert/adapterremoval

andremrsantos commented 4 years ago

We were evaluating a possible change to the trimming strategy in the dnase pipeline. When looking into our options we considered the following requirements:

  1. Can trim Next-seq polyG issue so it can replace filterNextSeqReadsForPolyG.py?
  2. Does it auto-detect quality score scheme (PHRED32 or PHRED64)?
  3. Does it include Illumina adapter removal? How much can we control it?
  4. Does it allow to limit read length (cropping)?
  5. Does it have multicore/multithread support?
  6. Does it produce an exit error code?

Looking into the options, we elected two tools which I have experience and familiarity: cutadapt and fastp.

cutadapt

Cutadapt is a python package (v.2.9 now available with python3/3.8.2 module) with broad support to adapter trimming. Regarding our criteria:

  1. Yes, it has --nextseq-trim <qual> flag which works as a quality trimmer removing G bases and those bellow the quality cutoff starting from the read 3' tail.
  2. It doesn't detect the quality scheme. It allows us to set the quality scheme with --quality-base, but the doesn't change the output scheme which may be required.
  3. It supports adapter removal with various control options, like the error rate with --error-rate and minimum match length with --overlap.
  4. It does support read cropping with --length.
  5. It does with -j
  6. Yes, it does.

see a command example:

## For paired mapping
cutadapt -j ${NSLOTS} --quiet -Z \
    --adapter file:${illuminaAdapters} -A file:${illuminaAdapters} --error-rate 0.1 --overlap 1 \
    --nextseq-trim 0 \
    --length 36 \
    --minimum-length 27 \
    --output $TMPDIR/${sample1}.fastq.gz --paired-output $TMPDIR/${sample2}.fastq.gz \
    ${readsFq} ${reads2fq}

 ## For single-end mapping
cutadapt -j ${NSLOTS} --quiet -Z \
    --adapter file:${illuminaAdapters} --error-rate 0.1 --overlap 1 \
    --nextseq-trim 0 \
    --length 36 \
    --minimum-length 27 \
    --output $TMPDIR/${sample1}.fastq.gz \
    ${readsFq}

fastp

Fastp is a C/C++ program and is processes FastQ files with high performance.

  1. Yes, it has --trim_poly_g flag which trims polyG tails from the reads. Different from cutadapt, it allows some mismatches but requires at least 8 G bases to make any cut. Fastp actually auto-detect Next-seq reads and apply this option by default.
  2. It also doesn't auto-detect quality scheme. It allows setting the scheme to PHRED64 with --phred64 flag, but it changes the output scheme to PHRED32.
  3. It supports adapter removal but with no control options.
  4. It does support read cropping with --max_len1 and --max_len2.
  5. It does with --thread
  6. Yes, it does.

see a command example:

## paired mapping
fastp --thread ${NSLOTS} -z 1 -Q \
    --json $TMPDIR/${sample1}.trim.json --html $TMPDIR/${sample1}.trim.html \
    --adapter_fasta ${illuminaAdapters} \
    --trim_poly_g \
    --max_len1 36 --max_len2 36 \
    --length_required 27 \
    --unpaired1 $TMPDIR/${sample1}.unpaired.fastq.gz --unpaired2 $TMPDIR/${sample2}.unpaired.fastq.gz \
    --out1 $TMPDIR/${sample1}.fastq.gz --out2 $TMPDIR/${sample2}.fastq.gz \
    --in1 ${readsFq} --in2 ${reads2fq}

## single-end mapping
fastp fastp --thread ${NSLOTS} -z 1 -Q \
    --json $TMPDIR/${sample1}.trim.json --html $TMPDIR/${sample1}.trim.html \
    --adapter_fasta ${illuminaAdapters} --trim_poly_g \
    --max_len1 36 \
    --length_required 27 \
    --unpaired1 $TMPDIR/${sample1}.unpaired.fastq.gz \
    --out1 $TMPDIR/${sample1}.fastq.gz \
    --in1 ${readsFq}

PolyG trimming details

It is important to notice that trimmers like cutadapt and fastp differ from the current protocol.

The current trimming protocol uses filterNextSeqReadsForPolyG.py to filter reads with 75% or more G bases. Trimmers like cutadapt and fastp instead look into the tail (3' end) of the reads and trims polyG tails from the reads.

cutadapt differs from fastp because it uses a quality trimmer to remove polyG tail, while fastp looks for an actual polyG sequence (allowing some mismatches) with a settable minimum size. For example the trimming of the following read:

## Original read
@ example
ACTTAGAATGACAGTTACCTgcgggggggg
+
++++++++++++++++++++++++++++++

## cutadapt trimming --nextseq-trim 0
## since it has a mismatch, it will does not trim the last two base
@ example
ACTTAGAATGACAGTTACCTgc
+
++++++++++++++++++++++

## cutadapt --nextseq-trim 10
## if we allow it to consider base quality, cutadapt overtrims
@ example
ACTTA
+
+++++

## fastp trimming
@ example
ACTTAGAATGACAGTTACCT
+
++++++++++++++++++++