jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

Auto trim feature #67

Closed wckdouglas closed 6 years ago

wckdouglas commented 6 years ago

Hi @jdidion, this pull request fixed issue #60 and part of issue #65 (I think).

Briefly, I added an auto-trim modifier (trim.modifiers.AutoAdapterCutter) to retain only overlapped parts of the paired-end reads when --aligner insert is used and no adapter sequences are given. This modifier uses --insert-match-error-rate, --insert-max-rmp and --minimum-length to control for the behavior of trimming.

Usage: Example 1:

$ atropos trim \
    -pe1 tests/data/paired.1.fastq \
    -pe2 tests/data/paired.2.fastq \
    --aligner insert \
    --interleaved-out - --quiet
@read1/1 some text
TTATTTGTCTCCAGC
+
##HHHHHHHHHHHHH
@read1/2 other text
GCTGGAGACAAATAA
+
HHHHHHHHHHHHHHH
@read2/1
CAACAGGCCACATTAGACATATCGGATGGT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read2/2
TGTGGCCTGTTGCAGTGGAGTAACTCCAGC
+
###HHHHHHHHHHHHHHHHHHHHHHHHHHH
@read3/1
CCAACTTGATATTAATAACA
+
HHHHHHHHHHHHHHHHHHHH
@read3/2
TGTTATTAATATCAAGTTGG
+
#HHHHHHHHHHHHHHHHHHH
@read4/1
GACAGGCCGTTTGAATGTTGACGGGATG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read4/2
CATCCCGTCAACATTCAAACGGCCTGTC
+
HH##########################

Example 2:

# tweaking minimum-length to trim
# off TTAGACATATCGGATGG from read2/1 and read2/2
$ atropos trim \
    -pe1 tests/data/paired.1.fastq \
    -pe2 tests/data/paired.2.fastq \
    --aligner insert \
    --interleaved-out - \
    --quiet \
    --minimum-length 10 \
    --insert-max-rmp 1
@read1/1 some text
TTATTTGTCTCCAGC
+
##HHHHHHHHHHHHH
@read1/2 other text
GCTGGAGACAAATAA
+
HHHHHHHHHHHHHHH
@read2/1
CAACAGGCCACA
+
HHHHHHHHHHHH
@read2/2
TGTGGCCTGTTG
+
###HHHHHHHHH
@read3/1
CCAACTTGATATTAATAACA
+
HHHHHHHHHHHHHHHHHHHH
@read3/2
TGTTATTAATATCAAGTTGG
+
#HHHHHHHHHHHHHHHHHHH
@read4/1
GACAGGCCGTTTGAATGTTGACGGGATG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read4/2
CATCCCGTCAACATTCAAACGGCCTGTC
+
HH##########################

Input:

$ seqtk mergepe \
    tests/data/paired.1.fastq \
    tests/data/paired.2.fastq
@read1/1 some text
TTATTTGTCTCCAGCTTAGACATATCGCCT
+
##HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read1/2 other text
GCTGGAGACAAATAACAGTGGAGTAGTTTT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read2/1
CAACAGGCCACATTAGACATATCGGATGGT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read2/2
TGTGGCCTGTTGCAGTGGAGTAACTCCAGC
+
###HHHHHHHHHHHHHHHHHHHHHHHHHHH
@read3/1
CCAACTTGATATTAATAACATTAGACA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHH
@read3/2
TGTTATTAATATCAAGTTGGCAGTG
+
#HHHHHHHHHHHHHHHHHHHHHHHH
@read4/1
GACAGGCCGTTTGAATGTTGACGGGATGTT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read4/2
CATCCCGTCAACATTCAAACGGCCTGTCCA
+
HH############################

Several things:

  1. as of current implementation, paired-end reads that don't pass the insert-match filters (either higher prob than random, insert-match error is higher than threshold, or the reads themselves are too short), it will return the unmodified read pairs. These reads are false negative that contains adapter sequences (as read2 in example 1), is there a better way to control for these?
  2. Not sure if the overlap filter should be controlled by --minimum-length or --overlap?
  3. The current implementation does not support the collection trimmed-off sequences for detection of adapters.
wckdouglas commented 6 years ago

I realize this may not be the optimal approach to address the problem, so I will close it for now.