save_merged discards the merged reads

santiagorevale commented 2 months ago

Description of the bug

Hi there!

I'm not sure if this is the intended usage, although I don't think so. When passing to the pipeline the option --save_merged, this both triggers FASTP to merge the paired-end reads and to save (and return) the merged FastQ file.

However, afterwards the pipeline doesn't make any use of these reads ending up being completely discarded.

The ideal behaviour if reads are merged should be to run UNICYCLER with all three options -1 reads.filtered.R1.fastq -2 reads.filtered.R2.fastq -s reads.merged.fastq so it makes use of the merged reads as single end reads as well as the remaining paired-end reads.

Alternatively, if the goal of the option is to discard the reads that could be merged, then I might not want to save them, just to discard them, in which case I would have to pass the -m option on my own using task.ext.args, which is fine, but we might want to clarify this on the documentation? Because people might think that if there's an option to save the merged FastQ file, then the pipeline might be merging the reads by default.

Command used and terminal output

No response

Relevant files

No response

System information

Version of nf-core/bacass: both master and dev

Daniel-VM commented 2 months ago

Hi santiagorevale,

I'm not sure if this is the intended usage, although I don't think so. When passing to the pipeline the option --save_merged, this both triggers FASTP to merge the paired-end reads and to save (and return) the merged FastQ file.

So far the --save_merged option is intended to save merged reads only (those are not being used for downstream analysis). Indeed, this param is not working as expected. I am PRing a fix.

The ideal behaviour if reads are merged should be to run UNICYCLER with all three options -1 reads.filtered.R1.fastq -2 reads.filtered.R2.fastq -s reads.merged.fastq so it makes use of the merged reads as single end reads as well as the remaining paired-end reads.

According to the Unicycler's documentation the -s param is intended to use unpaired reads... I am not sure how this can affect the pipeline, whether if it improves assemblies (passing merged reads as unpaired reads) or not...

Alternatively, if the goal of the option is to discard the reads that could be merged, then I might not want to save them, just to discard them, in which case I would have to pass the -m option on my own using task.ext.args, which is fine, but we might want to clarify this on the documentation? Because people might think that if there's an option to save the merged FastQ file, then the pipeline might be merging the reads by default.

If params.save_merged == true, the -m option will be enabled and *.merged.fastq.gz files will be saved. However, if it is false, the merging step in FASTP won’t be enabled, and no *.merged.fastq.gz files will be saved (PR in progress). This is the expected behavior, isn't it?. We can clarify it in the doc of course. So far the --save_merged params says that its a boolean value that will enable saving or discarding merged reads... What would you suggest to add here to improve clarification? (feel free to make a PR ;) )

santiagorevale commented 2 months ago

Hi @Daniel-VM , here's what I think:

if the merged reads are not going to be passed to the assembly process and they will end up being discarded, then I don't see the point on having the reads merged; if that's the case, I would prefer just to have the adapters trimmed and keep the read pairs;
if we decide merging the reads, even to filter them out, would be a good thing, then I believe we should be splitting the --save_merged behaviour into "performing" and "saving". What I mean is: I could decide to perform read merging in order to discard those reads but I don't care about the resulting merged file (e.g. --filter_merged_reads true), or I could decide to both perform read merging and saving the resulting file ((e.g. --filter_merged_reads true --save_merged true).
if we decide to leave the code as is, I would say to update the documentation of the parameter to something like
- "This option enables read merging and saves all merged reads to a file ending in *.merged.fastq.gz; these reads will be discarded and will NOT be passed to the assembly process."

According to the Unicycler's documentation the -s param is intended to use unpaired reads... I am not sure how this can affect the pipeline, whether if it improves assemblies (passing merged reads as unpaired reads) or not...

I don't know if it'll improve it or not. I just know that if you provide the 3 options altogether it doesn't complain. I posted a question at Unicycler's Github repo, issue 341. Let's see what they say about it.

Daniel-VM commented 2 months ago

🙌🏾 Good point.

It makes sense to allow users to decide whether to save or save and use merged.fq.gz files in downstream steps. Although it doesn't seem to significantly improve assembly results (as noted in unicycler#341), an ext.arg could be added to UNICYCLER when --filter_merged_reads true.

But thinking twice, this feature should be extended to all assembly tools for consistency (I am not sure this is possible). Otherwise, the parameter for using merged reads during genome assembly would need to be specific to Unicycler (that is adding a tag such as --unicycler_use_merged_reads).

if we decide to leave the code as is, I would say to update the documentation of the parameter to something like
* _"This option enables read merging and saves all merged reads to a file ending in `*.merged.fastq.gz`; these reads will be discarded and will NOT be passed to the assembly process."_

Sure, I think this description fit there even if we decide to leave it as is.

sralchemab commented 2 months ago

if we decide to leave the code as is, I would say to update the documentation of the parameter to something like

"This option enables read merging and saves all merged reads to a file ending in *.merged.fastq.gz; these reads will be discarded and will NOT be passed to the assembly process."

Sure, I think this description fit there even if we decide to leave it as is.

Yes, that's what I said, that if we were to leave the code as is, we should update the description of the argument to this.

Based on Ryan's reply from Unicycler, I believe that the best way to proceed would actually be to remove this option completely from the pipeline. I don't think it's adding any value the way it is.

Also, if we decided to split the arguments into --save_merged and --filter_merged_reads, we would be having to update the nf-core fastp module as well, because its behaviour is to merge and save the reads if the argument --save_merged is provided, and we wouldn't be able to overwrite this behaviour using ext.args, unless we avoid using the --save_merged option and handle the ext.args on our own whether somebody passed true or false to bacass.

Daniel-VM commented 2 months ago

Thanks or your clarification.

I agree with your suggestion to remove the --save_merged argument. Moreover, removing this option seems like the simplest and cleaning solution.

Daniel-VM commented 2 months ago

Closing this. Please reopen it if necessary

nf-core / bacass