nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 622 forks source link

Automatically delete files marked as temp as soon as not needed anymore #452

Open andreas-wilm opened 6 years ago

andreas-wilm commented 6 years ago

To reduce the footprint of larger workflows it's very useful if temporary files (which are marked as such) could be automatically deleted once they are not used anymore. Yes, this breaks reruns, but for easily recomputed files or large ones (footprint), this makes sense. Using scratch (see #230) is not always possible/wanted (e.g. for very large files and small scratch). It's also not always possible to delete those files as a user (except at the very end of the workflow), because multiple downstream processes running at different times might require them. This feature is for example implemented in snakemake, but maybe it's easily done there because the DAG is computed in advance?

Note, this is different from issue #165 where the goal was to remove non-declared files. The issue contains a useful discussion of the topic nevertheless.

Andreas

pditommaso commented 6 years ago

I'm adding this for reference. I agree that intermediate files handling needs to be improved but it will require some interval refactoring. Need to investigate. cc @joshua-d-campbell

pditommaso commented 6 years ago

I've brainstormed a bit more about this issue and actually it should be possible to remove intermediate output files without compromising the resume feature.

First problem, runtime generated DAG: tho execution graph is only generated at runtime, it's generally fully resolved immediately after the workflow execution starts. Therefore it would be enough to defer the output delete after the full resolution of the execution DAG. That's just after the run invocation and before the terminate.

Second problem is how to identify tasks eligible for output remove. This could be done intercepting a task (successful) completion event. Infer the upstream task in the DAG (easy) and if ALL dependant tasks have been successfully completed then cleanup the task work directory (note that each task can have more than one downstream task). Finally the task for which output have been removed must be marked with a special flag e.g. cached=true in the trace record.

Third, the resume process need to be re-implemented to take in consideration this logic. Currently when the -resume flag is specified the pipeline is just re-executed from the beginning, skipping the processes for which the output files already exists. However all (dataflow) output channel are created binding the output files to those channel.

Using the new approach this is not possible any more because the files are deleted therefore the execution has to be skipped up to the first task successfully executed task for which the (above) cached flag is not true. This that the output files of the last executed task can be picked and re-injected in the dataflow network and restart it.

This may require to introduce a new resume command #544. It could be also used to implement a kind of dry-run feature as suggested by #844. Finally this could also solve #828.

lucacozzuto commented 5 years ago

My two cents: if you can use a flag for indexing the processes (i.e. the sample name) you can define a terminal process that once completed triggers a deletion of the folders connected to that ID (if completed). I'm imagin a situation like this: [sampleID][PROCESS 1] = COMPLETED [sampleID][PROCESS 2] = COMPLETED [sampleID][PROCESS 3] = COMPLETED [sampleID][PROCESS 4 / TERMINAL] = COMPLETED

remove the folders of PROCESS 1 / 2 / 3/ 4 for that [sampleID]

In case you need to resume the pipeline, these samples will be re-run if the data are still in the input folder.

lucacozzuto commented 5 years ago

Quick comment: with this feature it will be possible to keep an instance of nextflow running (with watchPath) without having storage problems.

PeteClapham commented 5 years ago

So at a high level, I think I'm missing something. If the state data remains in files, the removal of old items is a good thing to do, but will this increase the filesystem IO contenition and locking as we increase the scale of analysis ?

pditommaso commented 5 years ago

Since each nextflow task has its own work directory and those directories would be deleted when the data is not needed (read accessed) any more I don't see why there's should be an IO contention on those files. I'm missing something?

lucacozzuto commented 4 years ago

I was thinking that maybe a directive that allows the removal of input files when a process is finished will allow to reduce the amount of space needed by a workflow. This should allow to remove the whole folders containing the input files, so that we reduce the number of folders too.

Of course this will not work if these files will be needed by other processes.

Maybe with the new DSL2 where you have to make the graph explicitly this can be achieved. If the cleaning conflicts with a workflow / process an error can be triggered.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fmorency commented 3 years ago

I like the ideas in this thread. Automatic removal of "intermediate process files" would be great.

olavurmortensen commented 3 years ago

This feature would be a game changer. As an example, one of our pipeline has to process ~10 GB of data produces ~100 GB of temporary data, and as a result the bottleneck is not CPU or memory, but diskspace. This severely limits the throughput of our lab, and results in poor utilization of processing power.

jvivian-atreca commented 3 years ago

I'm running into this with a pipeline that has similar characteristics to what @olavurmortensen is describing โ€” the temporary files produced by one tool are very large, so while this workflow's output is maybe a couple hundred gigs, it will need something like 7,000+ GB of disk space during execution.

That said, is there any reason that temporary file cleanup isn't the purview of the process's script? There are several ways to delete anything that doesn't match a specific pattern in bash, thereby removing all temporary files except the known inputs/outputs.

ewels commented 3 years ago

I agree that this feature would be extremely powerful ๐Ÿ‘๐Ÿป

Maybe also worth noting that it would be good to keep the ability to not auto-clean the intermediate files as well though. For example, a common(ish) use case for us in @nf_core is to resume a pipeline with different parameters. For example, use a different aligner but still use the cached steps for the initial preprocessing. But the majority of the time the auto-cleaning would be what most users want I think ๐Ÿงน๐Ÿงฝ

fredericlemoine commented 3 years ago

I am in the same situation for running a workflow with thousands of samples, and would need dozens of TB. Lots of bam files are needed only by 2 steps and then not anymore, and could be deleted. This feature would be very valuable ๐Ÿ‘๐Ÿป

ewels commented 3 years ago

Just a note that this issue came up again on the nf-core Slack and @spficklin has even implemented it at workflow level in https://github.com/SystemsGenetics/GEMmaker

So there is definitely need for this..

spficklin commented 3 years ago

To add to @ewels comment. For our GEMmaker workflow (https://github.com/SystemsGenetics/GEMmaker), we needed a workflow that could process 26 thousand RNA-Seq samples from NCBI SRA and we just didn't have the storage to deal with all of the intermediate files (and we had a dedicated 600TB storage system). I think this is really a critical issue for Nextflow because it prevents workflows from massively scaling. Once you hit your storage limit you're done, and if you're using the cloud you incur unnecessary storage costs. The only work around is to run the workflow separately for separate groups of samples so as not to overrun storage limits and that's just very cumbersome with massive numbers of samples.

The solution was really two-fold. In order to not overrun storage we had to first batch and then clean. We found that Nextflow tended to run "horizontally" (for lack of a better word) rather than "vertically". In other words. It tended to run all samples through step 1 before it would move on to step 2. So, even if we did clean intermediate files a few steps later, we would still overrun storage because we had intermediate files for 26K samples. from earlier steps.

To batch samples, we had to implement a folder system where initially we had a metadata file for each sample in a stage folder. The workflow moves a subset (equal to the number of cores specified) into a processing folder. The workflow only works on samples with a metadata file in the processing folder. Once a sample is fully complete the file moves from processing to a done folder and a new sample file is added to the processing folder which the workflow sees and starts processing.

To clean intermediate files we had to trick Nextflow into thinking the files were still there by wiping intermediate files and replacing them with "sparse" files. Essentially we replace the file with an empty version but because the file is "sparse" it still reports as the same size (but doesn't actually consume space on the file system). We have a little BASH script to do that (https://github.com/SystemsGenetics/GEMmaker/blob/master/bin/clean_work_files.sh).

I'm not necessarily advocating the Nextflow follow this approach. It adds complexity to the workflow with a bunch of channels used just for signaling. But I describe it in case others want to borrow this idea until there is an official fix. But also, I wanted to point out the need to first batch before cleaning because just cleaning alone doesn't necessarily solve the storage problem as the workflow runs.

ewels commented 3 years ago

Reminder for my future-self so that I don't lose this again: https://github.com/nextflow-io/nextflow/issues/649 - there is an existing option cleanup = true that can be used at the base of a config that will wipe all work intermediate folders if the workflow exits successfully. This is (currently) undocumented.

Note that it's not the same as the feature requested in this issue, as it runs once the pipeline is complete and not as it goes along. But still maybe relevant for some people ending up here from google..

abhi18av commented 3 years ago

Thanks for sharing that @ewels , this is pretty handy.

In case you have tried this option, does it affect the resume functionality?

ewels commented 3 years ago

In case you have tried this option, does it affect the resume functionality?

I haven't tried it but @jfy133 did. And yes, it wipes all intermediate files so resume is definitely dead ๐Ÿ˜…

lucacozzuto commented 3 years ago

Thanks for sharing this! I

ewels commented 2 years ago

Note that whilst related, #2135 and the cleanup = true config is different to what was requested here. This issue was originally about deleting intermediate files during pipeline execution, after downstream tasks requiring the files are complete.

pditommaso commented 2 years ago

Indeed

lucacozzuto commented 2 years ago

I was thinking about this some time ago. I think one of the problems will always be that nextflow runs the processes in a horizontal way. If we have 10 steps, it is likely we will run the first one in parallel and we will reach the last process, when cleaning is allowed, at the very end. So I would link this problem to the possibility to run a pipeline vertically by batches (i.e. you read XXX samples and move to the other steps and only when you complete the workflow for an input file you trigger a new execution.

spficklin commented 2 years ago

@lucacozzuto yes! we've hit on that as a problem in our workflow and had to employ a hacky solution just like you suggested to get around it (see my post above). I want to add my agreement with your comment.

lucacozzuto commented 2 years ago

A possible trick can be allowing a new directive for making batches. You indicate that the process X can process N times from the input channel and then pause. When the other processes are finished, there is a cleaning and a new start. We can use something similar to storeDir to avoid recalculating something useful each time. The only problem would be if we have a process that needs to be triggered when all the batches are processed...

pditommaso commented 2 years ago

lucacozzuto commented 2 years ago

Ehehehe, I'll bring you more coffee because of that

pinin4fjords commented 2 years ago

Just for my info, is there any prospect of this being addressed in the near future?

The inability to instantly delete intermediates (as e.g. Snakemake can do) is hitting us hard right now due to some stricter quotas, and our workflows are complex enough without doing https://github.com/nextflow-io/nextflow/issues/452#issuecomment-819733868.

lescai commented 2 years ago

same here, I must admit. the overhead for some nf-core pipelines is 10X the raw data, which is a lot for our HPC environment.

ChiBia commented 2 years ago

same problem here, I currently need to run nf-core/rnaseq on 700 samples but can't efficiently deal with the temporary files space request. I'll implement the batch & clean approach suggested by @spficklin (thank you so much for sharing the scripts!)

spficklin commented 2 years ago

Hi @ChiBia we have this infrastructure built into GEMmaker (https://gemmaker.readthedocs.io/en/latest/). Our workflow does not have as many options as the nf-core/rnaseq workflow but may save you time from implementing your own rna-seq workflow.

bentsherman commented 2 years ago

Linking my issue #2527 on task fusion because it is the only feasible solution to this problem that I've come across. Until then the only workarounds that I know of are to either (1) use GEMmaker's solution of stage/process/done directories + clean-work-files.sh or (2) manually combine processes into a single "end-to-end" process.

I hope to work on this feature in the near to mid-term, as it is crucial to running Nextflow pipelines at scale. In the meantime, I'm happy to offer guidance to people on implementing either of these workarounds in their own pipelines, just email me directly.

emosyne commented 2 years ago

cleanup = true

Hi, where do you put cleanup=true? In config file? what branch? Thanks

jfy133 commented 2 years ago

cleanup = true

Hi, where do you put cleanup=true? In config file? what branch? Thanks

https://www.nextflow.io/docs/latest/config.html?highlight=cleanup#miscellaneous

emosyne commented 2 years ago

thanks a lot.

bobamess commented 2 years ago

I usually put cleanup = true near the top of the config file after defining things like taskName, workspace, workDir and before any named blocks. However, in my experience it only deletes files in the subdir of workDir and not the subdir in the workDir, which would be nice. Unless this is because the last time I checked this it was with an older version of Nextflow.

jgarces02 commented 1 year ago

@bentsherman's solutions seems very attractive (clean_work_files.sh). Is there any possibilities to include it within next versions? (or how can I tweak sarek to include it?)

bentsherman commented 1 year ago

@jgarces02 for now you have to wire the dependencies yourself, see the GEMmaker pipeline script for an example. I'm currently trying to automate this behavior by specifying e.g. temporary: true on a path process output. I'm nearly at the point of understanding the codebase well enough to actually know how to do it. ๐Ÿ˜…

spficklin commented 1 year ago

Wonderful @bentsherman !

hw538 commented 1 year ago

any exciting news about this feature request? :)

bentsherman commented 1 year ago

This feature has been on the backburner this year due to other pressing efforts, but we're finally beginning to make some headway. I'm currently working on a PR (#3463) that will allow Nextflow to track the full task graph, which will comprise the "first half" of this feature (but it's also useful for other things like provenance).

The second half will be to use the task graph to figure out when an output file can be deleted, something like:

  1. process outputs can be marked as temporary: path(bam_file, temporary: true)
  2. given a temporary output file F, delete F when all consumers of F are complete
  3. on a resumed run, mark F as cached if all consumers of F are cached

Still kinda fuzzy about point (3). but I think there are a number of possible ways to do it.

spvensko commented 1 year ago

I am currently working on a blog post that will hopefully be published either later this week or early next week to go over examples of implementing GEMMaker's clean_work_files.sh strategy. The blog post will go over syntactical considerations of implementation and also a few pitfalls I encountered. I realize issue #3463 and associated future issues will hopefully make this issue absolute, but I think it's worth having a tutorial to help those that want to implement a solution in the meantime.

We've implemented this into our rather large neoantigen workflow (LENS) and it appears it will save us tons of storage.

bentsherman commented 1 year ago

@spvensko that's great, I agree it would be good to have a general example for people to reference in the meantime. I'd like to have such an example for the Nextflow patterns website (or wherever that content ends up in the website revamp), but I never got around to writing it myself. Looking forward to your blog post.

mribeirodantas commented 1 year ago

Please share it when it's done, @spvensko ๐Ÿ˜„

spvensko commented 1 year ago

Blog post is available now: https://pirl.unc.edu/blog/tricking-nextflows-caching-system-to-drastically-reduce-storage-usage

I'm going to be on PTO for the rest of the year, so hopefully there aren't any major issues with it. ๐Ÿ˜…

bentsherman commented 1 year ago

Folks, it's happening: #3818

Basically a minimal implementation of GEMmaker's "clean work files" approach directly into Nextflow. Several caveats and limitations to consider, but even this piece should be enough to make production pipelines much more storage efficient. Testing and feedback are appreciated! Feel free to message me on Slack if you don't want to clog up this issue.

stevekm commented 7 months ago

@bentsherman just wanted to follow up, is this feature 100% complete? was not sure since this Issue is still marked as Open. Thanks.

bentsherman commented 7 months ago

The automatic cleanup works but the resumability still has some issues. I had to focus on other things for a while but I have picked up this effort again, hope to finish the resumability in the next few months. See #3849 for updates.

If there are lots of people who don't care about the resumability piece, I could push to have the basic cleanup merged ASAP and complete the resumability in a separate effort. That would mean that for now, if you enable automatic cleanup and e.g. your pipeline fails half-way through due to some bug, you might not be able to resume because some task outputs will have been deleted.

cc @pditommaso @marcodelapierre for their thoughts

ewels commented 7 months ago

I would be in favour of getting the automatic cleanup feature in ASAP, with or without resumability ๐Ÿ‘๐Ÿป

For quite a few people this can make the difference between being able to run a pipeline at all or not, at which point being able to resume it is purely a nicety.

We should definitely aim to have the full cake, but getting in the basic cleanup quickly would be very nice.

lescai commented 7 months ago

I agree with @ewels when you analyse large datasets you might not have a choice, and ideally youโ€™d be in production and any source for failure at least not pipeline-related.

lucacozzuto commented 7 months ago

I also agree. To resume a pipeline is important but in some context you cannot even run the pipeline for lack of space