nf-core / demultiplex

Demultiplexing pipeline for sequencing data
https://nf-co.re/demultiplex
MIT License
44 stars 37 forks source link

MultiQC not resuming #233

Open apeltzer opened 3 months ago

apeltzer commented 3 months ago

unclear if this is intended or not --> verify

apeltzer commented 2 months ago

Ask @fmalmeida what he had to do to make this work :)

edmundmiller commented 2 months ago

I thought there was a "don't cache" setting somewhere, and it was intended, but there's not. It happens on every nf-core pipeline...

@ewels Any thoughts on where this is coming from?

Might be better to move this to tools.

apeltzer commented 2 months ago

Thought the same initially, but its not been set here. Not a major problem here anyways (and negligible runtime too, considering how much $$$ go into demuxing an entire flowcell ;-)).

grst commented 2 months ago

I woudn't say the runtime is negligible... on a recent large flow cell, multiqc ran for ~1h (not sure how much time was wasted on staging-in files though).

I also never got why one would intentionally not resume multiqc...

fmalmeida commented 2 months ago

Hey hey hey, The main thing that makes the MultiQC module do not cache is the cache = false that sometimes is added as @edmundmiller mentioned, but mainly the fact that many run-specific variable metadata is added to the MultiQC Summary Map wich makes this input-map of metadata always different for every run, and thus, never caching, see here:

https://github.com/nf-core/demultiplex/blob/master/lib/NfcoreTemplate.groovy#L72-L95

apeltzer commented 2 months ago

This means that its not so easy to adapt this without changing the workflow_summary_mqc.yaml and methods_description_mqc.yaml by changing whats ingested into these two YAML files as there are some variables that contain timestamps and thus are updated on any resume. To be more explicit lets close this ticket, enable caching = false in the conf/modules.config for multiqc (so that users get what they think they will get) and leave it as is. If we at some point decide to take this on, I would suggest we can still do this in a next / patch release. Thanks for your points @fmalmeida :)

nschcolnicov commented 2 months ago

I assessed this in the current dev branch (commit id: 892b9d8cc5beade252777428bd6df440dd874468). The main conflicting channel is ch_multiqc_files, which contains two files that are different with each execution: workfow_summary_mqc.yaml and methods_description_mqc.yaml.

These files are modified with each execution because they contain some data like timestamp of execution, runName, among others. In order to have multiqc resume we would need to:

  1. Change the collect operator for the ch_multiqc_files and add "sort: true".
  2. Update the content of the workflow_summary_mqc.yaml file to remove runName, or develop a rule so that it uses the same runName as the previous execution if every other process was ran from cache.
  3. Update the methods_description_mqc.yaml file so that it doesn't contain runName, timestamp, and any other value that changes with execution, or use a similar rule as for workflow_summary_mqc.yaml.
grst commented 2 months ago

Thanks for the analysis... If this is to be changed, then it should happen at the pipeline template level in nf-core/tools.

nschcolnicov commented 2 months ago

Added it: https://github.com/nf-core/demultiplex/pull/239

apeltzer commented 2 months ago

I will file an issue there and we can take it up once this has been agreed upon in the wider community - will x-ref this ticket here so we can take it up once there was a decision in the community... :) See this one: https://github.com/nf-core/tools/issues/3110