Analyse data set that contains unknown primer set

ilwookkim commented 5 months ago

Description of feature

Hi all,

Recently I start to analysis 16s rRNA seq. Our lab used Zymo Quick-16S Primer Set V3-V4 for the amplicon. However the zymo doesn't provide exact seqeunce of primer, instead provide only length of primer.

When I provide primer seq as 'N{16}', cutadapt refuses to use very long wildcard string. Therefore I would like to trim the primer seq by length using fastp.

Please let me know whether it is possible or not. Thanks a lot.

d4straub commented 5 months ago

I would first ask at Zymo for their primer sequence. Because it can be of great advantage to filter reads with primer sequences. Alternatively, there are some options:

trim sequences outside of ampliseq using anything you want, then apply ampliseq with --skip_cutadapt

use cutadapt to trim reads by modifying this part with a config, i.e. append to your command -c cutadapt.config that contains (see -u 20 -U 20 for forward and reverse 20 nuc primer) [not tested, you might need to solve the variables!]

process {
withName: CUTADAPT_BASIC {
    ext.args = { [
        "--minimum-length 1 -u 20 -U 20",
        "-O ${params.cutadapt_min_overlap}",
        "-e ${params.cutadapt_max_error_rate}",
        params.pacbio ? "--rc -g ${meta.fw_primer}...${meta.rv_primer_revcomp}" :
            params.iontorrent ? "--rc -g ${meta.fw_primer}...${meta.rv_primer_revcomp}" :
            params.single_end ? "-g ${meta.fw_primer}" :
            "-g ${meta.fw_primer} -G ${meta.rv_primer}",
        params.retain_untrimmed ? '' : "--discard-untrimmed"
    ].join(' ').trim() }
}

skip cutadapt with --skip_cutadapt and use a config (similar to above) to make filtntrim trim reads, see here, by using an appropriate config

Edit: Essentially, I am not for adding another feature to solve this, because it is not good practice to not know the primer sequence.

ilwookkim commented 5 months ago

Thanks for the answer. I would like to try second option because Zymo refused to provide their primer seq :( Thanks again. Best,

d4straub commented 5 months ago

As detailed in #744 my suggestion doesnt work. Please do the following:

Use -c cutadapt.config as above
append --FW_primer GGGGGGGGGG --RV_primer GGGGGGGGGG --retain_untrimmed --cutadapt_min_overlap 10
run as usual, but do not use --qiime_ref_taxonomy or --cut_dada_ref_taxonomy because they will require correct primers

This will give fake primer sequences to cutadapt so that it wont complain. It will set a primer match to at least 10*G (which will never match) and allow all reads that did not contain a primer (i.e. all) to pass. But it will remove by -u & -U the unknown primer sequences.

Let me know how that goes.

ilwookkim commented 5 months ago

Thanks a lot! It works without any issue.

d4straub commented 5 months ago

Great! Then I'll close that issue, feel free to open another one if you come across another problem. You could also join nf-core slack via https://nf-co.re/join to get access to a more chat like function for questions like that.

nf-core / ampliseq

Analyse data set that contains unknown primer set #743

Description of feature