Open racng opened 1 year ago
There is no built-in way to set a variable error tolerance for a single adapter like this. Also, the alignment between the read and the adapter is not computed, so there’s no way to find out where matches, mismatches and deletions are – unless you use --no-indels
and then you would just need to compare the adapter and the part of the read to which it was matched character by character.
Somehow, you need to split up the search into two steps: One for the first 8 nt sample id, one for the 33 nt sequence following it. My initial thought was to trim the sample identifier first, but since you use a regular 5' adapter and 8 nt is quite short, you would probably get false positives that you wouldn’t get when including the subsequent 33 nt in the search.
So my suggestion is the following instead.
N{8}
, i.e., use -g N{8}mysequence
. Use --action=retain
so that the part before the sample id is removed, but not the id itself, and also use --discard-untrimmed
so that you can be sure that only sequences with the expected 33 nt sequence are used in the next step.-g ^file:sampleids.fasta
) and do the demultiplexing (which I assume you want to).This is a bit more work than a single cutadapt
invocation, sorry about that. I agree that using a single, high error tolerance for the combined sequence could be an issue; I’ll have to think about whether this could be supported directly in Cutadapt.
Is there a way to specify error tolerance for a specific subsequence within an adapter? For example, in my regular 5' adapter, the first 8 nucleotides indicates the sample identity but the following 33 nts are shared across all samples. If I set a specific error tolerance (e.g.
-e 6
), I am afraid the errors might concentrate within the first 8 nts. I don't find anything I could use in the documentation or github issues that I can utilize, but hope to find out any clever tricks for this purpose. Thanks!Is there a way to export what characters in the adapter sequence is matched in the output of
--info-file
? Something like+++-+.+++
for a adapter with 9 nts, where+
is a match,-
is a mismatch,.
is a deletion? As I am requiring full match usingmin_overlap
equal to length of adapter, if I get access to this field, then I can process the info file to filter out reads that has too many errors within the first 8 nucleotides.