Open mattmaurano opened 4 years ago
We were evaluating a possible change to the trimming strategy in the dnase pipeline. When looking into our options we considered the following requirements:
filterNextSeqReadsForPolyG.py
? Looking into the options, we elected two tools which I have experience and familiarity: cutadapt and fastp.
Cutadapt is a python package (v.2.9 now available with python3/3.8.2 module) with broad support to adapter trimming. Regarding our criteria:
--nextseq-trim <qual>
flag which works as a quality trimmer removing G bases and those bellow the quality cutoff starting from the read 3' tail.--quality-base
, but the doesn't change the output scheme which may be required.--error-rate
and minimum match length with --overlap
.--length
.-j
see a command example:
## For paired mapping
cutadapt -j ${NSLOTS} --quiet -Z \
--adapter file:${illuminaAdapters} -A file:${illuminaAdapters} --error-rate 0.1 --overlap 1 \
--nextseq-trim 0 \
--length 36 \
--minimum-length 27 \
--output $TMPDIR/${sample1}.fastq.gz --paired-output $TMPDIR/${sample2}.fastq.gz \
${readsFq} ${reads2fq}
## For single-end mapping
cutadapt -j ${NSLOTS} --quiet -Z \
--adapter file:${illuminaAdapters} --error-rate 0.1 --overlap 1 \
--nextseq-trim 0 \
--length 36 \
--minimum-length 27 \
--output $TMPDIR/${sample1}.fastq.gz \
${readsFq}
Fastp is a C/C++ program and is processes FastQ files with high performance.
--trim_poly_g
flag which trims polyG tails from the reads. Different from cutadapt, it allows some mismatches but requires at least 8 G bases to make any cut. Fastp actually auto-detect Next-seq reads and apply this option by default.--phred64
flag, but it changes the output scheme to PHRED32.--max_len1
and --max_len2
.--thread
see a command example:
## paired mapping
fastp --thread ${NSLOTS} -z 1 -Q \
--json $TMPDIR/${sample1}.trim.json --html $TMPDIR/${sample1}.trim.html \
--adapter_fasta ${illuminaAdapters} \
--trim_poly_g \
--max_len1 36 --max_len2 36 \
--length_required 27 \
--unpaired1 $TMPDIR/${sample1}.unpaired.fastq.gz --unpaired2 $TMPDIR/${sample2}.unpaired.fastq.gz \
--out1 $TMPDIR/${sample1}.fastq.gz --out2 $TMPDIR/${sample2}.fastq.gz \
--in1 ${readsFq} --in2 ${reads2fq}
## single-end mapping
fastp fastp --thread ${NSLOTS} -z 1 -Q \
--json $TMPDIR/${sample1}.trim.json --html $TMPDIR/${sample1}.trim.html \
--adapter_fasta ${illuminaAdapters} --trim_poly_g \
--max_len1 36 \
--length_required 27 \
--unpaired1 $TMPDIR/${sample1}.unpaired.fastq.gz \
--out1 $TMPDIR/${sample1}.fastq.gz \
--in1 ${readsFq}
It is important to notice that trimmers like cutadapt and fastp differ from the current protocol.
The current trimming protocol uses filterNextSeqReadsForPolyG.py
to filter reads with 75% or more G bases. Trimmers like cutadapt and fastp instead look into the tail (3' end) of the reads and trims polyG tails from the reads.
cutadapt differs from fastp because it uses a quality trimmer to remove polyG tail, while fastp looks for an actual polyG sequence (allowing some mismatches) with a settable minimum size. For example the trimming of the following read:
## Original read
@ example
ACTTAGAATGACAGTTACCTgcgggggggg
+
++++++++++++++++++++++++++++++
## cutadapt trimming --nextseq-trim 0
## since it has a mismatch, it will does not trim the last two base
@ example
ACTTAGAATGACAGTTACCTgc
+
++++++++++++++++++++++
## cutadapt --nextseq-trim 10
## if we allow it to consider base quality, cutadapt overtrims
@ example
ACTTA
+
+++++
## fastp trimming
@ example
ACTTAGAATGACAGTTACCT
+
++++++++++++++++++++
Andre, can you please summarize our email thread here regarding possible change to trimmer in the dnase pipeline?
Include the alternative packages, the criteria I outlined, and your answers to those questions
TODO: Memory usage?
Other options not investigating further: bbduk https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-5-337 https://github.com/MikkelSchubert/adapterremoval