marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Controlling error rate of substring within regular adapter #684

Open racng opened 1 year ago

racng commented 1 year ago

Is there a way to specify error tolerance for a specific subsequence within an adapter? For example, in my regular 5' adapter, the first 8 nucleotides indicates the sample identity but the following 33 nts are shared across all samples. If I set a specific error tolerance (e.g. -e 6), I am afraid the errors might concentrate within the first 8 nts. I don't find anything I could use in the documentation or github issues that I can utilize, but hope to find out any clever tricks for this purpose. Thanks!

Is there a way to export what characters in the adapter sequence is matched in the output of --info-file ? Something like +++-+.+++ for a adapter with 9 nts, where + is a match, - is a mismatch, . is a deletion? As I am requiring full match using min_overlap equal to length of adapter, if I get access to this field, then I can process the info file to filter out reads that has too many errors within the first 8 nucleotides.

marcelm commented 1 year ago

There is no built-in way to set a variable error tolerance for a single adapter like this. Also, the alignment between the read and the adapter is not computed, so there’s no way to find out where matches, mismatches and deletions are – unless you use --no-indels and then you would just need to compare the adapter and the part of the read to which it was matched character by character.

Somehow, you need to split up the search into two steps: One for the first 8 nt sample id, one for the 33 nt sequence following it. My initial thought was to trim the sample identifier first, but since you use a regular 5' adapter and 8 nt is quite short, you would probably get false positives that you wouldn’t get when including the subsequent 33 nt in the search.

So my suggestion is the following instead.

  1. Search for the 33 nt sequence first, but precede the sequence by N{8}, i.e., use -g N{8}mysequence. Use --action=retain so that the part before the sample id is removed, but not the id itself, and also use --discard-untrimmed so that you can be sure that only sequences with the expected 33 nt sequence are used in the next step.
  2. Search for the ids using anchored 5' adapters (-g ^file:sampleids.fasta) and do the demultiplexing (which I assume you want to).
  3. Finally, remove the 33 nt sequence using again an anchored 5' adapter. You would need to do this for all the demultiplexed output files.

This is a bit more work than a single cutadapt invocation, sorry about that. I agree that using a single, high error tolerance for the combined sequence could be an issue; I’ll have to think about whether this could be supported directly in Cutadapt.