Closed GoogleCodeExporter closed 9 years ago
Hi, thanks for looking so detailed into cutadapt's output :-)
I can explain the result of seq1 and seq3. The reported number of errors for
seq2 may be a bug, I'll look into that further.
The only sensible match that cutadapt could report for seq1 is
GGAATAGGTCTGCGGCTG TCGTATGCCTCCTACTCCTTC AA
(Where the middle is the part of the read that matches the adapter.)
But the alignment between
TCGTATGCCTCCTACTCCTTC (bases in read) and
=========XX==X==X===X
TCGTATGCCGTCTTCTGCTTG (adapter)
has five errors. Since at most 4 errors are allowed for matches between 20 and
21 bp, it's not reported.
It seems you expected that cutadapt would match a prefix of the adapter
sequence to the read since you indicated that 'TCGTATGCCTCCT' could be a match.
Cutadapt allows only two types of matches (for 3' adapters): Either the adapter
occurs in full somewhere within the read or a prefix of the adapter matches a
suffix of the read (overlap alignment). This models what happens in reality:
Either the adapter is added in full to the DNA fragment or it's not. If the
read is long enough, we see it in full. If the read is not long enough, we only
see its beginning.
Regarding seq3, the match you indicate would have three errors:
TCGTATGCCTTGTA (end of read)
=========X=X=X
TCGTATGCCGTCTTCTGCTTG (adapter)
The match has a length of 14 bp (counting adapter bases). As you see under the
"No. of allowed errors:" heading in the status report, only 2 errors are
allowed for matches of length 14 ("10-14 bp: 2"). Thus, this match is ignored.
Please tell me whether you think cutadapt's alignment algorithm should behave
differently.
I'll see whether I can figure out what went wrong with seq2.
Original comment by marcel.m...@tu-dortmund.de
on 13 Sep 2014 at 9:49
Ok, the case of seq2 is quite an interesting one ...
I've reproduced the issue with a shorter version of your sequences: the read
TCGTATGCCCTCC and the adapter TCGTATGCCGTCTTC.
Here is the alignment that you would expect:
TCGTATGCCCTCC (read)
=========X==X
TCGTATGCCGTCTTC (adapter)
But it turns out that the actual alignment that is found is this one:
TCGTATGCCGTCTTC
=========X==XX=
TCGTATGCCCTC--C
Since it has length 15 and contains three errors, the error rate is actually
3/15=20%, so it's within the threshold.
To understand why that alignment was chosen, one needs to know that cutadapt
doesn't really care about the actual number of errors as long as the alignment
has not more errors than the maximum error rate allows (20% in your case). The
alignment algorithm doesn't try to find an alignment with few errors, but
instead it tries to find one with many matches. And in this case, the first
alignment has 11 matches, and the second one has 12.
This is sometimes a bit surprising, as you have noticed, but is consistent with
the way I wanted the algorithm to behave.
As an explanation, the problem lies in the maximum error rate: I want users to
be able to specify that as a parameter because it is quite easy to understand.
The problem then is that you cannot optimize for 'as few errors as possible'
since an alignment in which the read and the adapter do not overlap at all is
always optimal (since it has zero errors). Typically, this is solved by using
alignment scores, which I didn't want since they aren't as easy to understand.
Instead, I chose to optimize the number of matches (but still under the
constraint that the actual error rate may not go above the specified maximum
error rate). If you are really, really interested (or bored), you could read
the chapter about cutadapt in my PhD thesis, available at
http://hdl.handle.net/2003/31824 , where this optimization problem and the
algorithm are described in more detail. (I'm not offended if you don't read it.)
One other important point to note is that this particularity doesn't influence
the trimming result: Try to run cutadapt with the maximum error rate reduced to
0.16. This prevents the second alignment with 3 errors to be found and suddenly
the reported number of errors in the info file is down to 2.
Unless you think the alignment algorithm should behave differently, I'd like to
not change cutadapt's behavior, including the seemingly incorrect reported
number of errors. I will add some more documentation instead, and maybe a
pointer to this issue report. Perhaps it would also make sense to print the
number of matches in the alignment to the info file since that is the
optimization criterion.
Original comment by marcel.m...@tu-dortmund.de
on 13 Sep 2014 at 10:58
Thank you very much for your careful explanation! I also scanned the figures in
your thesis and understand the behavior of cutadapt much better now. Fig 2.4
was interesting to get a quick idea of how you calculated the best matching
sequence within a specified error rate! Also, Fig 2.1 in your thesis could be a
good addition to cutadapt's documentation - it made me easily orient to
cutadapt's trimming behavior (btw, the README link in your google project home
page appears broken).
I see now the rationale you've for the current alignment/error model, and so
have no suggestions for an alternate model. However, I do request you to fully
specify the exact algorithm in the "Algorithm" section of the documentation to
help detail-oriented :) users understand the exact reason why certain sequences
are trimmed and others not. I found some description of the algorithm in your
Embnet paper, but it doesn't seem to fully agree with what you said above.
So perhaps along with Fig 2.1, you could also cite your thesis and consider
including something like this in the algorithm description (excerpted from your
email):
"The algorithm optimizes the number of matches under the constraint that the
actual error rate may not go above the specified maximum error rate. So it
doesn't exclusively optimize the number of errors."
"Cutadapt allows only two types of matches for 3' adapters to model what
happens in reality when read sequencing runs into an adapter: either the
adapter occurs in full somewhere within the read or a prefix of the adapter
matches a suffix of the read (overlap alignment)."
Also, while I am at it, I thought another addition to the documentation
(actually just to the command-line argument "help" would suffice too) could
clarify --too-long-output=f1 option. When it is set, file f1 contained both
trimmed and non-trimmable (i.e., reads untrimmed due to no adapter match at
permitted error rate) reads that were long, and not just the former as one may
assume. As a result, the --untrimmed-output=f2 file contents would change
depending on whether the --too-long-output=f1 and -M options are set or not.
Original comment by nma...@gmail.com
on 15 Sep 2014 at 5:38
Hi, thanks for your suggestions. I'm working on updating the documentation and
will add a proper section about the algorithm. The description in the EMBnet
paper is for an older version of cutadapt. I switched the algorithm after the
paper appeared.
I will also clarify how the --too-long-output option works and how it works
together with --untrimmed-output.
I'll leave this issue open until I've done the above - it may take a while.
Original comment by marcel.m...@tu-dortmund.de
on 17 Sep 2014 at 9:14
I haven't used the exact words you suggested, but I think I have covered all
your suggested documentation improvements. The updated documentation is at
https://cutadapt.readthedocs.org/ . Thanks again for your suggestions!
Original comment by marcel.m...@tu-dortmund.de
on 4 Oct 2014 at 5:57
Original issue reported on code.google.com by
nma...@gmail.com
on 12 Sep 2014 at 4:20