Closed alexhbnr closed 1 year ago
I understand the confusion. Since you supplied that single FASTQ to the pipeline, I guess, you do have it somewhere but I can see that it would be more convenient to have all reads together after run merging.
Yes, I agree it is not ideal.
Unfortunately this is an 'intrinsict' thing with the 'react' based design of Nextflow. It's not trivial to work out which is the 'final' FASTQ file that we could publish to a final directory, particularly when you are mixing different data inputs (in this case single run vs multi-run)
We've had the same problem in eager too for a very long time.
The possible options I can see are:
(3) might be possible but will needs a lot of thought going into it, and time to test all possible scenarios.
EDIT: I HAVE AN IDEA @Midnighter please validate:
what about having a .collectFile(storeDir: ${params.outdir}/analysis-read-reads/)
somewhere in the channel going into profiling
?
I need to look at context to comment on your idea.
I was wondering, why not send all fastq to the merging process and publish everything even if they are single runs?
A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space
It could basically be a noop for a single input file but allow to publish all its output files in one directory easily. I'll think about your comment above tomorrow, though.
OK, I am not so familiar with the Nextflow design. I would likely do some sort of spaghetti with if/else
statements but if there is a smarter way then that's the way to go.
Just stating that you have to activate both --save_hostremoval_unmapped
and --save_runmerged_reads
wouldn't really solve the
A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space
issue because you would now duplicate all the data for the samples with multiple sequencing runs.
Sorry to be clear, none of the above are actual solutions just partly addressing the problem even if it's 'we can't do it'.
Ultimately the issue is working out what is the final step that is run isn't so hard (even if it's a lot conditionals), the problem trying to track samples with the exception where the file to be saved needs to come from the upstream step. In away, this is a different type of metadata than that of just dectecing which processes have been turned on or off
so for example, for the conditional to save the output from host_removal
!params.perform_run_merging && params.save_final_reads
and for host removal
if ( !params.perform_run_merging || params.perform_run_merging && !${meta.multirun} ) && params.perform_<shortread/longread>_hostremoval && params.save_final_reads
I think should work,
But the question now is how do we extract/collect the the ${meta.multirun}
info
If I can get this to work, maybe we could go one step better and at the same time directly generate a samplesheet of the processed reads that can go directly to other pipelines, e.g. mag (idea inspired by @rotifergirl)
Should also x-ref: https://github.com/nf-core/eager/issues/945
Description of the bug
Maybe there is a misunderstanding about what particular parameters of the pipeline mean but I think the way setting the parameter
--save_runmerged_reads
currently performs when run merging is activated is not what I expected to do. I am not sure whether to file this underbug
orenhancement
so apologies for calling it a bug.I have the following use case: I have ten samples with multiple runs that I would like to have merged prior to taxonomic profiling, but I have a single sample with only a single run that doesn't require merging. I would like to have the FastQ files after pre-processing and, if necessary, run-merging for all these files.
At the moment, enabling
--save_runmerged_reads
will give me the FastQ files for all ten samples that required run merging, but not the single FastQ sample that doesn't need it. I haven't enabled any other parameters for keeping the files, e.g. after host removal (--save_hostremoval_unmapped
). Therefore, I am currently lacking the FastQ for this single sample.I personally would have expected that the FastQ files of samples that don't need run merging are still exported into the
results
directory. If this isn't happening on purpose, then I would suggest to add a parameter--save_reads_for_taxprofiling
or something along the lines that allows to get the final FastQ files per sample. I would like to avoid having to export the reads both after the host removal and run merging because it would duplicate all the data for the samples that were run merged.Command used and terminal output
Relevant files
No response
System information
No response