marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

Unexpected trimming behavior #784

Closed adrianomartinelli closed 1 month ago

adrianomartinelli commented 1 month ago

Hey! I am playing around with cutadapt to figure out if I can use it for our use case. I ran into the following issue:

My target sequence is AAAAAAAAAA and I ran cutadapt agains the sequences shown below that are of the same length but with a substitution either at the start or the end of the sequence. I would have expected that with an overlap of 9 and 0 error rate both reads in the input.fastq are identified. However, substitutions at the end are treated differently from substitutions at the start?

Maybe as a general background on what I try to achieve. I have reads structured like:

{barcode_1}{const_region_1}{barcode_2}{const_region_2}{barcode_3}

Each barcode_N can come from a set of ~100 different barcodes (i.e. the barcode a position barcode_1 and barcode_3 might be identical. My goal is to obtain a table indicating how often each barcode combinations was observed during the sequencing. My plan was to sequentially "trim" adapter sequences and using the info.tsv file to trace the barcode identifiers. Alternatively, I could maybe use the demultiplexing strategy, but this would lead to thousands of output files.

Any suggestions of how to handle this and insights to the adapter trimming as described above are welcomed.

Best, Adriano

marcelm commented 1 month ago

Hi, here comes a late reply because I was on vacation. Let me know how you solved the problem if you have already.

Option -g allows the adapter to occur 1) anywhere in full within the read and 2) partially in the beginning (at the 5' end). The alignment that is found in your second example (read-snp-end) looks like this:

 AAAAAAAAAA  Adapter
  |||||||||
  AAAAAAAAAT Read

The -g option does not allow partial occurrences at the 3' end, which is why no adapter is found for the first example.

Since your barcode_1 appears to be at the 5' end of the read, I would instead recommend using anchored 5' adapters (that is, add ^ before the sequence, in your case it would probably be -g ^file:barcodes.fasta).

I would suggest to run one round where you trim the first barcode, then another where you trim the first constant region, then trim the second barcode etc.

I agree demultiplexing is probably not a good fit here.

Instead of using an info file, you could use --rename with {adapter_name} to put the name of the barcode into the FASTQ header. This may be easier because the read header will thus have all barcode names in the end and you can throw away the files from the intermediate rounds. If you use info files, you need to process all the the intermediate files.

adrianomartinelli commented 1 month ago

Hei @marcelm ! Thanks for the clarification and the hint with the rename, I think this is a very good approach.

I was now playing around with the anchoring option and observed another unexpected trimming behavior. I have the following input.fastq with an adapter AAA at the start of the construct. The first read has no error and then I create a SNP at position 0,1 and 3 of the adapter. Furthermore, I create indels before, in and after the adapter.

@read-no-error
AAAGTCAAA
+
~~~~~~~~~
@read-SNP-0
TAAGTCAAA
+
~~~~~~~~~
@read-SNP-1
ATAGTCAAA
+
~~~~~~~~~
@read-SNP-2
AATGTCAAA
+
~~~~~~~~~
@read-indel-0
TAAAGTCAAA
+
~~~~~~~~~~
@read-indel-1
ATAAGTCAAA
+
~~~~~~~~~~
@read-indel-2
AATAGTCAAA
+
~~~~~~~~~~
@read-indel-3
AAATGTCAAA
+
~~~~~~~~~~

Then I run cutadapt, allowing for 1 error and the anchored adapter option.

cutadapt -e 1 \
-g ^AAA \
-o out.fastq input.fastq \
--rename '{id}-{comment}-{adapter_name}' \
--info-file=info.tsv

And I obtain the following info.tsv

read-no-error--1    0   0   3       AAA GTCAAA  1       ~~~ ~~~~~~  
read-SNP-0--1       1   0   3       TAA GTCAAA  1       ~~~ ~~~~~~
read-SNP-1--1       1   0   3       ATA GTCAAA  1       ~~~ ~~~~~~
read-SNP-2--1       1   0   2       AA  TGTCAAA 1       ~~  ~~~~~~~
read-indel-0--1     1   0   4       TAAA GTCAAA 1       ~~~~ ~~~~~~
read-indel-1--1     1   0   3       ATA AGTCAAA 1       ~~~ ~~~~~~~
read-indel-2--1     1   0   2       AA TAGTCAAA 1       ~~  ~~~~~~~~
read-indel-3--1     0   0   3       AAA TGTCAAA 1       ~~~ ~~~~~~~

I am surprised to see read-SNP-2--1 and read-indel-2--1 be trimmed like that. Would that not imply an alignment like:

 AATGTCAAA
X||
AAA

With error = 1 and a score of -2 + 1 + 1 = 0. Instead of

AATGTCAAA
||X
AAA

this alignment with error = 1 and score 1 + 1 + -1 = 1.

And for read-indel-2--1

# error = 1
# score = -2 + 1 + 1 = 0
 AATAGTCAAA
X||
AAA

instead of

# error = 1
# score = 1 + 1 - 1 = 1
AATAGTCAAA
||X
AAA

It seems like indels at the start are not punished or counted as errors? Since I am using the anchored option for which the sequence has to occur in full length I did not expect that.

If I run the command against the same files allowing for two errors I get the following trimming

cutadapt -e 2 \
-g ^AAA \
-o out.fastq input.fastq \
--rename '{id}-{comment}-{adapter_name}' \
--info-file=info.tsv
read-no-error--1    0   0   3       AAA GTCAAA  1       ~~~ ~~~~~~  
read-SNP-0--1   1   0   3       TAA GTCAAA  1       ~~~ ~~~~~~  
read-SNP-1--1   2   0   1       A   TAGTCAAA    1       ~   ~~~~~~~~    
read-SNP-2--1   1   0   2       AA  TGTCAAA 1       ~~  ~~~~~~~ 
read-indel-0--1 1   0   4       TAAA    GTCAAA  1       ~~~~    ~~~~~~  
read-indel-1--1 2   0   1       A   TAAGTCAAA   1       ~   ~~~~~~~~~   
read-indel-2--1 1   0   2       AA  TAGTCAAA    1       ~~  ~~~~~~~~    
read-indel-3--1 0   0   3       AAA TGTCAAA 1       ~~~ ~~~~~~~ 

Again, appreciated if you could enlighten me.

Best, Adriano

marcelm commented 1 month ago

Thanks for reporting! In short, it appears that the score matrix is not initialized correctly.

Let me reply to these comments first:

Would that not imply an alignment like:

 AATGTCAAA
X||
AAA

[...] It seems like indels at the start are not punished or counted as errors?

They are penalized and incur one error. There are three equivalent versions of the alignment, all of which delete one of the A. So it would be:

-AATGTCAAA
X||
AAA

or:

A-ATGTCAAA
|X|
AAA

or:

A-ATGTCAAA
|X|
AAA

You are correct that the other alignment with one mismatch instead of one indel would result in a higher score and should be preferred. This turns out to be a bug.

Cutadapt can print the dynamic programming matrix that contains the edit distances when you run it with --debug --debug. I have updated this now so that it also prints the matrix with scores.

For you (first) example, it looks like this (omitting irrelevant log output):

$ echo -e '>r\nAATGTCAAA' | cutadapt --debug --debug -e 1 -g ^AAA -
...
Edit distances:
      A  A  T  G  T  C  A  A  A
   0  1  2  3  4               
A  1  0  1  2  3               
A  2  1  0  1  2               
A  3     1  1  2               
Scores:
      A  A  T  G  T  C  A  A  A
   0  0  0  0  0               
A  0  1  1 -1 -3               
A  0  1  2  0 -2               
A  0     2  1 -1               

The adapter is on the vertical axis, the query on the horizontal axis.

The matrix with edit distances looks fine: There are two entries with edit distance 1 in the last row, which correspond to the two alternatives of either deleting one of the As or letting the last A be a mismatch to the T.

The score matrix, however, is initialized incorrectly: The scores in the left column should not all be zero, but should reflect the deletion score (they should be 0, -2, -4, -6).

I’ll look into this.

marcelm commented 1 month ago

Should now be fixed:

$ echo -e '>r\nAATGTCAAA' | cutadapt --quiet -e 1 -g ^AAA -
>r
GTCAAA
adrianomartinelli commented 4 weeks ago

Great thanks for the quick fix!