nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
188 stars 119 forks source link

Dada2 MergePairs : consensus between merging & concatenating reads #803

Open weber8thomas opened 6 days ago

weber8thomas commented 6 days ago

Authors

Context & description

This pull request introduces an enhancement to the DADA2::mergePairs() process within the pipeline, enabling conditional merge or concatenation of sequences based on the overlap between forward and reverse reads. Previously, the pipeline allowed only to use either one of the two methods (merging or concatenating) of paired-end reads, via the --concatenate_reads parameter. This enhancement introduces the "consensus" method, allowing the pipeline to dynamically determine the appropriate method—merging or concatenation—thereby improving sequence assembly accuracy and downstream analysis outcomes.

The core enhancement revolves around incorporating conditional logic to assess the overlap between paired-end reads and decide whether to merge or concatenate them. This decision is based on a specified overlap threshold, ensuring that only reads with adequate overlap are merged, while others are concatenated with a defined spacer.

Enhancement Highlights:

Parameters changed or introduced

By default, when setting mergepairs_strategy to merge or concatenate, fixed generic values will still be used regarding the paired-reads alignment (see below):

match = 1, mismatch = -64, gap = -64, minOverlap = 12, maxMismatch = 0

However, by setting mergepairs_strategy to consensus, we offer now the possibility to tweak those values in order to adjust the alignment done between the reads of the same pair. This can be done my modifying the following newly introduced parameters:

This approach is particularly beneficial for datasets containing both prokaryotic and eukaryotic sequences, where lengths vary, leading to differing overlap extents.

PR checklist

weber8thomas commented 5 days ago

Thanks, that looks much cleaner!

(1) When I run

nextflow run weber8thomas/ampliseq -r consensus-pr -profile test,singularity --outdir results_consensus-pr -resume --skip_qiime

the warning

WARN: Access to undefined parameter `concatenate_reads` -- Initialise it to a default value eg. `params.concatenate_reads = some_value

(2) when appending --asv_concatenate_reads I get the warning

ERROR ~ Validation of pipeline parameters failed!

 -- Check '.nextflow.log' file for details
The following invalid input values have been detected:

* --asv_concatenate_reads (true): Expected any of [[false, true, consensus]]

(3) appending --concatenate_reads "consensus" results into

ERROR ~ Unknown config attribute `params.match` -- check config file: /home/daniel/.nextflow/assets/weber8thomas/ampliseq/nextflow.config

I think all problems should be solved by my comments. Edit: thats not true, all occurrences of concatenate_reads have to be replaced by asv_concatenate_reads to solve (1)

Thanks for the feedback @d4straub ! I implemented your suggestions and the execution worked on my side with and without --asv_concatenate_reads consensus

EDIT : just noticed I have an issue with how match & mismatch values should be defined, here is the original code from @nhenry50 below:

        ext.args2 = [
            'minOverlap = 12, maxMismatch = 0, propagateCol = character(0), gap = -64, homo_gap = NULL, endsfree = TRUE, vec = FALSE',
            params.concatenate_reads == "consensus" ? "returnRejects = TRUE, match = 5, mismatch = -6" :
                params.concatenate_reads == "concatenate" ? "justConcatenate = TRUE, returnRejects = FALSE, match = 1, mismatch = -64" :
                "justConcatenate = FALSE, returnRejects = FALSE, match = 1, mismatch = -64"
        ].join(',').replaceAll('(,)*$', "")

How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ?

d4straub commented 5 days ago

How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ?

That might be an issue. I can currently only think of using two params (introducing one more), which will complicate things, or remove the params and fix the values in the config. That way the values are still modifiable using a config, but less accessible for the standard user because working with configs requires more knowledge. Therefore, if those values do not need to be changed all the time and the values you propose here are usually fine, I'd say remove the params and put the values in the config (e.g. mismatch = -64 instead of mismatch = ${params.asv_mismatch} and remove --asv_mismatch altogether).

weber8thomas commented 5 days ago

How could I define default values to match = 1, mismatch = -64 if consensus is not used, and match = 5, mismatch = -6 if used ?

That might be an issue. I can currently only think of using two params (introducing one more), which will complicate things, or remove the params and fix the values in the config. That way the values are still modifiable using a config, but less accessible for the standard user because working with configs requires more knowledge. Therefore, if those values do not need to be changed all the time and the values you propose here are usually fine, I'd say remove the params and put the values in the config (e.g. mismatch = -64 instead of mismatch = ${params.asv_mismatch}).

Actually, match = 1, mismatch = -64 should be fixed when consensus is not used but we would like to give the possibility to adjust values when consensus is enabled. Would the last commit be okay in that regard? The user would only be able to play with parameters like asv_match asv_mismatch ... when --asv_concatenate_reads consensus is set.

d4straub commented 5 days ago

Would the last commit be okay in that regard?

Yes, but the documentation should be adjusted accordingly. And the parameter names are also not really fitting any more.

weber8thomas commented 5 days ago

Would the last commit be okay in that regard?

Yes, but the documentation should be adjusted accordingly. And the parameter names are also not really fitting any more.

Will work on this! Which prefix would you suggest regarding the parameter names : asvconsensus, asv_mergepairsconsensus, mergepairsconsensus, denoisingconsensus ?

d4straub commented 5 days ago

mergepairs_consensus_ sounds good I think. Thats precise. Could you also rename the asv_ prefix? mergepairs_ seems better suited, since it only affects PE reads!? And asv_concatenate_reads seems now that it has 3 choices also imperfect. Maybe just mergepairs, or mergepairs_type or such? Not sure, I'm sure you'll find a intuitive name.

weber8thomas commented 5 days ago

mergepairs_consensus_ sounds good I think. Thats precise. Could you also rename the asv_ prefix? mergepairs_ seems better suited, since it only affects PE reads!? And asv_concatenate_reads seems now that it has 3 choices also imperfect. Maybe just mergepairs, or mergepairs_type or such? Not sure, I'm sure you'll find a intuitive name.

Thanks for the feedback! :)

I updated all the different parameters introduced with prefix mergepairs_consensus_ and changed asv_concatenate_reads to mergepairs_strategy, I think it capture properly the idea of that one.