nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.61k stars 606 forks source link

Request to enable dynamic `cache` directive #5022

Open robsyme opened 1 month ago

robsyme commented 1 month ago

New feature

It would be helpful to be able to set the cache directive via a closure.

Usage scenario

It would be sometimes helpful to force a re-run of a specific task (in cases where the outputs are corrupted, for example). For users that don't have access to the run workdir, it would be helpful to set the following configuration:

process {
  withName: MyProcess {
    cache = { task.tag != "mytag" }
  }
}

At the moment, this closure is not evaluated, and is simply compared directly do the available options, and we get the warning:

WARN: Unknown cache mode: Script_a35edbdc3426f6a0$_runScript_closure2$_closure5@30922f8d

Suggest implementation

Something in ProcessConfig, I suppose :D

pditommaso commented 1 month ago

If it's not supported,likely there's a reason ..

bentsherman commented 1 month ago

It should be possible since the task inputs are resolved before the hash is computed: https://github.com/nextflow-io/nextflow/blob/e2e608140cdde1da39df4c911f56286015538228/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L2240-L2241

greenberga commented 1 month ago

Unless anyone has picked this up, I will give it a shot!

bentsherman commented 1 month ago

Feel free. I think you just need to get the hash mode from the TaskConfig instead of the ProcessConfig

pditommaso commented 2 weeks ago

What's the use case for this?

robsyme commented 2 weeks ago

To force a specific task to be recomputed.

We had a case where even though the task exited with exitstatus 0, the output files were incomplete/corrupted. The user didn't have easy access to aws s3 rm s3://bucket/path/to/longtaskhashgoeshere/.exitcode so it would have been convenient to set cache = false for a specific task based on the meta.id.

pditommaso commented 2 weeks ago

Too smart! but then, I'd would be nicer to have run option for it e.g. -invalidate-tasks <names>

robsyme commented 2 weeks ago

I always forget that we can just add new features :D

How would you address the task to be retried? By task hash?

pditommaso commented 2 weeks ago

I was think just process name(s)

robsyme commented 2 weeks ago

Process-level cache invalidation is already possible:

process {
  withName: Example {
    cache = false
  }
}

The problem we're trying to solve here is task level cache invalidation, so you'd need a way to address a specific task. My feeling is that you'd either need to use the task-level variables meta.id, for example or the task hash.