improve documentation regarding the trimming algorithm and output files

GoogleCodeExporter commented 9 years ago

Hi, 
  Thanks for the really useful tool! It is working mostly fine, but I see unexpected results for many of my sequences, three of which is posted below.

  Specifically, cutadapt's calculated/reported error is one more than the actual error for seq2 below (but it still gets reported despite calculated error rate > 20%). Also for seq1 and seq3, I feel that there are better matching position as explained below. I would appreciate any clarifications and whether I am understanding the error calculation or alignment model right!

Thanks,
Mani

-------------------------------------------------------------     
$ cat tstcase.fasta
>seq1
GGAATAGGTCTGCGGCTGTCGTATGCCTCCTACTCCTTCAA
>seq2
GGGGGCGGGCGCTCTGGAGAGGCGGGTGTCGTATGCCCTCC
>seq3
GGAGAGCACATCCGGTGTTAGAAGCGCTCGTATGCCTTGTA
$ cutadapt -a TCGTATGCCGTCTTCTGCTTG -e 0.2 --trimmed-only --info-file=tmp.info 
tstcase.fasta
>seq2
GGGGGCGGGCGCTCTGGAGAGGCGGGTG
>seq3
GGAGAGCACATCCGGTGTTAGAAGCGCTCGTATGCC
cutadapt version 1.5
Command line parameters: -a TCGTATGCCGTCTTCTGCTTG -e 0.2 --trimmed-only - 
--info-file=tmp.info
Maximum error rate: 20.00%
   No. of adapters: 1
   Processed reads:            3
   Processed bases:          123 bp (0.0 Mbp)
     Trimmed reads:            2 (66.7%)
     Trimmed bases:           18 bp (0.0 Mbp) (14.63% of total)
   Too short reads:            0 (0.0% of processed reads)
    Too long reads:            0 (0.0% of processed reads)
        Total time:      0.00 s
     Time per read:      0.000 ms

=== Adapter 1 ===

Adapter 'TCGTATGCCGTCTTCTGCTTG', length 21, was trimmed 2 times.

No. of allowed errors:
0-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4

Overview of removed sequences
length  count   expect  max.err error counts
5       1       0.0     1       0 1
13      1       0.0     2       0 0 0 1

[narayananm@submit-1 cage.mapping.mystery]$ cat tmp.info
seq2    3       28      41      GGGGGCGGGCGCTCTGGAGAGGCGGGTG    TCGTATGCCCTCC   
        1
seq3    1       36      41      GGAGAGCACATCCGGTGTTAGAAGCGCTCGTATGCC    TTGTA   
        1
-------------------------------------------------------------

seq3 could be better matched a few bases marked --> below: 
GGAGAGCACATCCGGTGTTAGAAGCGC-->TCGTATGCC    TTGTA

seq1 could also be matched using permitted error like so, but cutadapt doesn't 
report it.  
GGAATAGGTCTGCGGCTG-->TCGTATGCCTCCT<--ACTCCTTCAA

Original issue reported on code.google.com by nma...@gmail.com on 12 Sep 2014 at 4:20

GoogleCodeExporter commented 9 years ago

Hi, thanks for looking so detailed into cutadapt's output :-)

I can explain the result of seq1 and seq3. The reported number of errors for 
seq2 may be a bug, I'll look into that further.

The only sensible match that cutadapt could report for seq1 is

GGAATAGGTCTGCGGCTG TCGTATGCCTCCTACTCCTTC AA
(Where the middle is the part of the read that matches the adapter.)
But the alignment between
TCGTATGCCTCCTACTCCTTC (bases in read) and
=========XX==X==X===X
TCGTATGCCGTCTTCTGCTTG (adapter)
has five errors. Since at most 4 errors are allowed for matches between 20 and 
21 bp, it's not reported.
It seems you expected that cutadapt would match a prefix of the adapter 
sequence to the read since you indicated that 'TCGTATGCCTCCT' could be a match. 
Cutadapt allows only two types of matches (for 3' adapters): Either the adapter 
occurs in full somewhere within the read or a prefix of the adapter matches a 
suffix of the read (overlap alignment). This models what happens in reality: 
Either the adapter is added in full to the DNA fragment or it's not. If the 
read is long enough, we see it in full. If the read is not long enough, we only 
see its beginning.

Regarding seq3, the match you indicate would have three errors:
TCGTATGCCTTGTA (end of read)
=========X=X=X
TCGTATGCCGTCTTCTGCTTG (adapter)

The match has a length of 14 bp (counting adapter bases). As you see under the 
"No. of allowed errors:" heading in the status report, only 2 errors are 
allowed for matches of length 14 ("10-14 bp: 2"). Thus, this match is ignored.

Please tell me whether you think cutadapt's alignment algorithm should behave 
differently.

I'll see whether I can figure out what went wrong with seq2.

Original comment by marcel.m...@tu-dortmund.de on 13 Sep 2014 at 9:49

GoogleCodeExporter commented 9 years ago

Ok, the case of seq2 is quite an interesting one ...

I've reproduced the issue with a shorter version of your sequences: the read 
TCGTATGCCCTCC and the adapter TCGTATGCCGTCTTC.

Here is the alignment that you would expect:

TCGTATGCCCTCC   (read)
=========X==X
TCGTATGCCGTCTTC (adapter)

But it turns out that the actual alignment that is found is this one:

TCGTATGCCGTCTTC
=========X==XX=
TCGTATGCCCTC--C

Since it has length 15 and contains three errors, the error rate is actually 
3/15=20%, so it's within the threshold.

To understand why that alignment was chosen, one needs to know that cutadapt 
doesn't really care about the actual number of errors as long as the alignment 
has not more errors than the maximum error rate allows (20% in your case). The 
alignment algorithm doesn't try to find an alignment with few errors, but 
instead it tries to find one with many matches. And in this case, the first 
alignment has 11 matches, and the second one has 12.

This is sometimes a bit surprising, as you have noticed, but is consistent with 
the way I wanted the algorithm to behave.

As an explanation, the problem lies in the maximum error rate: I want users to 
be able to specify that as a parameter because it is quite easy to understand. 
The problem then is that you cannot optimize for 'as few errors as possible' 
since an alignment in which the read and the adapter do not overlap at all is 
always optimal (since it has zero errors). Typically, this is solved by using 
alignment scores, which I didn't want since they aren't as easy to understand. 
Instead, I chose to optimize the number of matches (but still under the 
constraint that the actual error rate may not go above the specified maximum 
error rate). If you are really, really interested (or bored), you could read 
the chapter about cutadapt in my PhD thesis, available at 
http://hdl.handle.net/2003/31824 , where this optimization problem and the 
algorithm are described in more detail. (I'm not offended if you don't read it.)

One other important point to note is that this particularity doesn't influence 
the trimming result: Try to run cutadapt with the maximum error rate reduced to 
0.16. This prevents the second alignment with 3 errors to be found and suddenly 
the reported number of errors in the info file is down to 2.

Unless you think the alignment algorithm should behave differently, I'd like to 
not change cutadapt's behavior, including the seemingly incorrect reported 
number of errors. I will add some more documentation instead, and maybe a 
pointer to this issue report. Perhaps it would also make sense to print the 
number of matches in the alignment to the info file since that is the 
optimization criterion.

Original comment by marcel.m...@tu-dortmund.de on 13 Sep 2014 at 10:58

GoogleCodeExporter commented 9 years ago

Thank you very much for your careful explanation! I also scanned the figures in 
your thesis and understand the behavior of cutadapt much better now. Fig 2.4 
was interesting to get a quick idea of how you calculated the best matching 
sequence within a specified error rate! Also, Fig 2.1 in your thesis could be a 
good addition to cutadapt's documentation - it made me easily orient to 
cutadapt's trimming behavior (btw, the README link in your google project home 
page appears broken). 

I see now the rationale you've for the current alignment/error model, and so 
have no suggestions for an alternate model. However, I do request you to fully 
specify the exact algorithm in the "Algorithm" section of the documentation to 
help detail-oriented :) users understand the exact reason why certain sequences 
are trimmed and others not. I found some description of the algorithm in your 
Embnet paper, but it doesn't seem to fully agree with what you said above.  

So perhaps along with Fig 2.1, you could also cite your thesis and consider 
including something like this in the algorithm description (excerpted from your 
email): 
"The algorithm optimizes the number of matches under the constraint that the 
actual error rate may not go above the specified maximum error rate. So it 
doesn't exclusively optimize the number of errors."
"Cutadapt allows only two types of matches for 3' adapters to model what 
happens in reality when read sequencing runs into an adapter: either the 
adapter occurs in full somewhere within the read or a prefix of the adapter 
matches a suffix of the read (overlap alignment)."

Also, while I am at it, I thought another addition to the documentation 
(actually just to the command-line argument "help" would suffice too) could 
clarify --too-long-output=f1 option. When it is set, file f1 contained both 
trimmed and non-trimmable (i.e., reads untrimmed due to no adapter match at 
permitted error rate) reads that were long, and not just the former as one may 
assume. As a result, the --untrimmed-output=f2 file contents would change 
depending on whether the --too-long-output=f1 and -M options are set or not.

Original comment by nma...@gmail.com on 15 Sep 2014 at 5:38

GoogleCodeExporter commented 9 years ago

Hi, thanks for your suggestions. I'm working on updating the documentation and 
will add a proper section about the algorithm. The description in the EMBnet 
paper is for an older version of cutadapt. I switched the algorithm after the 
paper appeared.

I will also clarify how the --too-long-output option works and how it works 
together with --untrimmed-output.

I'll leave this issue open until I've done the above - it may take a while.

Original comment by marcel.m...@tu-dortmund.de on 17 Sep 2014 at 9:14

Changed title: improve documentation regarding the trimming algorithm and output files
Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I haven't used the exact words you suggested, but I think I have covered all 
your suggested documentation improvements. The updated documentation is at 
https://cutadapt.readthedocs.org/ . Thanks again for your suggestions!

Original comment by marcel.m...@tu-dortmund.de on 4 Oct 2014 at 5:57

Changed state: Done

jgaetel / cutadapt

improve documentation regarding the trimming algorithm and output files #80