nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
178 stars 112 forks source link

Implement passthru of cutadapt min overlap and error rate options #426

Closed tjcreedy closed 2 years ago

tjcreedy commented 2 years ago

Description of feature

Hi, I have a couple of projects to analyse where the primers are degenerate and the last 3 or 5 bases have ambiguities that allow them to match the first 3-5 bases of the primer sequence on the amplicon.

For example, this primer TCDGGRTGNCCRAARAAYCA matches twice on this read:

       |------primer------|
       TCAGGATGTCCAAAGAACCAATCGATCG
match1 --|
match2 |------------------|

Cutadapt obviously takes the longer match where it can. However, in reads where there are enough sequencing errors within the primer to exceed the default error rate, and we would desire that these reads be discarded by cutadapt instead of being trimmed, cutadapt instead trims the first 3 bases and outputs the incorrect read. Testing shows that running cutadapt without setting the minimum overlap on these data does create a small number of erroneously trimmed or retained reads (~2-4%) compared with setting minimum overlap to 4.

Thus, I would like to be able to set the minimum overlap for cutadapt to be greater than 3, but there are no ampliseq options that allow this argument to be passed through.

This can currently be hacked by including the min_overlap string along with the primer, i.e.:

--RV_primer "'TCDGGRTGNCCRAARAAYCA;min_overlap=4'"

but it seems better that this should be able to be implemented in the options. Of course, it may also be desirous in some cases to be able to adjust the error rate, to match primer sequences with more than 10% errors, so being able to pass this parameter through to cutadapt would also be useful.

I'm new to Nextflow and unfortunately not yet at the level where I could make a pull request and implement this myself. Hopefully in the future!

Many thanks for an excellent workflow

d4straub commented 2 years ago

Hi there, I am glad that you like ampliseq! There are indeed ways for non-implemented parameters to be forwarded to specific pipeline steps (processes that are defined in modules). Most (i.e. not fixed) settings of modules are written in conf/modules.config, for cutadapt here. You can modify those by using the nextflow parameter -c, i.e. nextflow run nf-core/ampliseq <your params> -c cutadapt.config where cutadapt.config contains:

process {
    withName: CUTADAPT_BASIC {
        ext.args = '--minimum-length 1 -g "TCDGGRTGNCCRAARAAYCA;min_overlap=4" -G "TCDGGRTGNCCRAARAAYCA;min_overlap=4" --discard-untrimmed'
    }
}

Problem here is that the original ext.args is overwritten, i.e. primer sequences, single_end, etc. are not taken into consideration. The user needs to know what (s)he is doing. Also, careful with -resume when using -c, it ignores changes of ext. settings in configs and just loads cached results if available. The concept of configs is also (partially) explained in https://nf-co.re/ampliseq/2.3.1/usage#custom-configuration. Hope that helps.

where the primers are degenerate and the last 3 or 5 bases have ambiguities that allow them to match the first 3-5 bases of the primer sequence on the amplicon.

That could actually lead to too many sequences removed in the DADA2 chimera removal, please take care here.

This can currently be hacked by including the min_overlap string along with the primer, i.e.: --RV_primer "'TCDGGRTGNCCRAARAAYCA;min_overlap=4'"

This only works with some settings of the pipeline, with others it should fail (i.e. using --illumina_pe_its or --iontorrent) because primer sequences are reverse-complemented for some steps.

I agree, that min_overlap setting might be a good idea to add as a parameter. I'll keep the issue open to address this at a later time.

edit: would you suggest the global cutadapt parameter -O & -e are sufficient or do you think separate parameters for primers are required? I'd think global might be enough?

tjcreedy commented 2 years ago

Hi,

Thanks so much for your response. I'd read about configs but hadn't connected the idea of using them for process-level overwrites of input arguments, I'm already using them for executors so this is great and is sufficiently straightforward to work in our wider pipeline. I suspected there would be issues with including other strings in the primer sequence, and since Illumina PE ITS is one of our data types this is great to know. This would also be a good way to implement things like non-internal adapters, which I've used before when adapters/tags/indices haven't been correctly trimmed by sequencing facilities.

That could actually lead to too many sequences removed in the DADA2 chimera removal, please take care here.

I appreciate the note of caution, it's not something I'd considered - I assume that as long as primers have been removed successfully (or untrimmed discarded) then this shouldn't be an issue?

edit: would you suggest the global cutadapt parameter -O & -e are sufficient or do you think separate parameters for primers are required? I'd think global might be enough?

I agree that global parameters are sufficient. I can't imagine a metabarcoding situation where one might want to allow a lower overlap for one primer than another, if the data is poor enough to need this then it probably shouldn't be being used anyway.

Many thanks!

d4straub commented 2 years ago

I have added -O and -e as --cutadapt_min_overlap and --cutadapt_max_error_rate to the dev branch. This will be in the next release.

Let me know if you see other improvements that seem to be of broader interest!

tjcreedy commented 2 years ago

Thank you! I may well have more suggestions in the future, although I'm hoping to learn enough Nextflow that I can try to implement them myself rather than shouting from the sidelines.