nf-core / taxprofiler

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data
https://nf-co.re/taxprofiler
MIT License
128 stars 36 forks source link

Save run-merged reads for single run samples #262

Closed alexhbnr closed 1 year ago

alexhbnr commented 1 year ago

Description of the bug

Maybe there is a misunderstanding about what particular parameters of the pipeline mean but I think the way setting the parameter --save_runmerged_reads currently performs when run merging is activated is not what I expected to do. I am not sure whether to file this under bug or enhancement so apologies for calling it a bug.

I have the following use case: I have ten samples with multiple runs that I would like to have merged prior to taxonomic profiling, but I have a single sample with only a single run that doesn't require merging. I would like to have the FastQ files after pre-processing and, if necessary, run-merging for all these files.

At the moment, enabling --save_runmerged_reads will give me the FastQ files for all ten samples that required run merging, but not the single FastQ sample that doesn't need it. I haven't enabled any other parameters for keeping the files, e.g. after host removal (--save_hostremoval_unmapped). Therefore, I am currently lacking the FastQ for this single sample.

I personally would have expected that the FastQ files of samples that don't need run merging are still exported into the results directory. If this isn't happening on purpose, then I would suggest to add a parameter --save_reads_for_taxprofiling or something along the lines that allows to get the final FastQ files per sample. I would like to avoid having to export the reads both after the host removal and run merging because it would duplicate all the data for the samples that were run merged.

Command used and terminal output

nextflow run nf-core/taxprofiler -r dev -profile eva --input samplelist.csv --databases databaselist.csv --perform_shortread_qc --shortread_qc_tool adapterremoval --shortread_qc_mergepairs --shortread_qc_includeunmerged --shortread_qc_minlength 35 --perform_shortread_hostremoval --perform_runmerging --save_runmerged_reads --run_kraken2

Relevant files

No response

System information

No response

Midnighter commented 1 year ago

I understand the confusion. Since you supplied that single FASTQ to the pipeline, I guess, you do have it somewhere but I can see that it would be more convenient to have all reads together after run merging.

jfy133 commented 1 year ago

Yes, I agree it is not ideal.

Unfortunately this is an 'intrinsict' thing with the 'react' based design of Nextflow. It's not trivial to work out which is the 'final' FASTQ file that we could publish to a final directory, particularly when you are mixing different data inputs (in this case single run vs multi-run)

We've had the same problem in eager too for a very long time.

The possible options I can see are:

  1. Do nothing to the code, just document this better
  2. Dump all preprocessed files into a single directory
    • This will still result in file redundency, as you will still need to turn on the save of both possible 'final' files for run- and non-run-merged, but at least they are all in one directory
  3. Come up with some very complex logic to work out which file to publish and when into a final results diretory
    • I keep trying different scenarios in my head but still haven't come up with a satisfactory solution that just doesn't result in horrible spagetti conditional code
    • e.g. publish A in the final directory but only if X is on, Y and Z are turned off, but also only when it A goes through run merge, else publish J. But then only publish B if Y is turned on and Z is off (Etc. etc.)
      1. Add an additional dummy process that doesn't actually do anything, but allows us to 'save' the file in the directory

(3) might be possible but will needs a lot of thought going into it, and time to test all possible scenarios.

jfy133 commented 1 year ago

EDIT: I HAVE AN IDEA @Midnighter please validate:

what about having a .collectFile(storeDir: ${params.outdir}/analysis-read-reads/) somewhere in the channel going into profiling?

Midnighter commented 1 year ago

I need to look at context to comment on your idea.

I was wondering, why not send all fastq to the merging process and publish everything even if they are single runs?

jfy133 commented 1 year ago

A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space

Midnighter commented 1 year ago

It could basically be a noop for a single input file but allow to publish all its output files in one directory easily. I'll think about your comment above tomorrow, though.

alexhbnr commented 1 year ago

OK, I am not so familiar with the Nextflow design. I would likely do some sort of spaghetti with if/else statements but if there is a smarter way then that's the way to go.

Just stating that you have to activate both --save_hostremoval_unmapped and --save_runmerged_reads wouldn't really solve the

A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space

issue because you would now duplicate all the data for the samples with multiple sequencing runs.

jfy133 commented 1 year ago

Sorry to be clear, none of the above are actual solutions just partly addressing the problem even if it's 'we can't do it'.

Ultimately the issue is working out what is the final step that is run isn't so hard (even if it's a lot conditionals), the problem trying to track samples with the exception where the file to be saved needs to come from the upstream step. In away, this is a different type of metadata than that of just dectecing which processes have been turned on or off

jfy133 commented 1 year ago

so for example, for the conditional to save the output from host_removal

!params.perform_run_merging && params.save_final_reads

and for host removal

if ( !params.perform_run_merging || params.perform_run_merging && !${meta.multirun} ) &&  params.perform_<shortread/longread>_hostremoval && params.save_final_reads

I think should work,

But the question now is how do we extract/collect the the ${meta.multirun} info

jfy133 commented 1 year ago

If I can get this to work, maybe we could go one step better and at the same time directly generate a samplesheet of the processed reads that can go directly to other pipelines, e.g. mag (idea inspired by @rotifergirl)

jfy133 commented 1 year ago

Should also x-ref: https://github.com/nf-core/eager/issues/945

jfy133 commented 1 year ago

Suggestion from Sarek: