nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
163 stars 107 forks source link

Analyse data set that contains unknown primer set #743

Closed ilwookkim closed 1 month ago

ilwookkim commented 1 month ago

Description of feature

Hi all,

Recently I start to analysis 16s rRNA seq. Our lab used Zymo Quick-16S Primer Set V3-V4 for the amplicon. However the zymo doesn't provide exact seqeunce of primer, instead provide only length of primer.

When I provide primer seq as 'N{16}', cutadapt refuses to use very long wildcard string. Therefore I would like to trim the primer seq by length using fastp.

Please let me know whether it is possible or not. Thanks a lot.

d4straub commented 1 month ago

I would first ask at Zymo for their primer sequence. Because it can be of great advantage to filter reads with primer sequences. Alternatively, there are some options:

Edit: Essentially, I am not for adding another feature to solve this, because it is not good practice to not know the primer sequence.

ilwookkim commented 1 month ago

Thanks for the answer. I would like to try second option because Zymo refused to provide their primer seq :( Thanks again. Best,

d4straub commented 1 month ago

As detailed in #744 my suggestion doesnt work. Please do the following:

This will give fake primer sequences to cutadapt so that it wont complain. It will set a primer match to at least 10*G (which will never match) and allow all reads that did not contain a primer (i.e. all) to pass. But it will remove by -u & -U the unknown primer sequences.

Let me know how that goes.

ilwookkim commented 1 month ago

Thanks a lot! It works without any issue.

d4straub commented 1 month ago

Great! Then I'll close that issue, feel free to open another one if you come across another problem. You could also join nf-core slack via https://nf-co.re/join to get access to a more chat like function for questions like that.