nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.77k stars 629 forks source link

Files in working directory get overwritten with links to themselves when `-resume`(ing) a pipeline run. #2177

Closed rob-p closed 3 years ago

rob-p commented 3 years ago

Bug report

We've been testing out nextflow for putting together a pipeline for running some different scRNA-seq processing tools. The ecosystem has been incredible and the reporting and introspection capabilities have been amazing. Also nf-tower has been a game changer! With that out of the way, I'm experiencing the following bug which is pretty critical, especially was we are doing pipeline runs that take many hours and execute many tools.

The behavior we're observing is that when "resuming" a run with the -resume flag, the appropriate processes are flagged as cached, however the actual underlying files for the cached process seem to be gone. The file names show up in the published directory, but they link to files in the work directory (which is fine). However, the files in the work directory link back to themselves ... recursively ... and so the actual file content seems to be blown away.

Expected behavior and actual behavior

Expected behavior: The original files that constitute the output of the cached run remain in tact in the work directory. Actual behavior: The original files are overwritten with links to themselves (see the screenshot below):

nf_self_link

Steps to reproduce the problem

Use the -resume flag to resume a workflow where previous steps have succeeded and are cached but other steps remain to be executed. It seems that only cached steps are affected by this problem.

Program output

There is no unexpected output for the process. The log correctly shows these processes as cached, but the output files are affected.

Environment

This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

pditommaso commented 3 years ago

I suspect there's something wrong in your pipeline script. Is the code avail or even better do you have a test case that replicates the problem?

rob-p commented 3 years ago

Hi @pditommaso,

Thank you for the super-quick response. I've added you as a collaborator with view permissions on the repository that contains the pipeline. Once the analyses are done (and the code cleaned up) this will be a public repository, but it's private for the time being. It is absolutely possible that there is something wrong in our pipeline script. I (and all in my lab) are relative newbies to nextflow and DSL2, and so this is one of the first non-trivial pipelines we've written. I am sure there are many places where we are avoiding best practices (and making outright mistakes in feature usage). Any suggestions you can provide would be greatly appreciated, but especially any suggestions on how to avoid this "recursive link overwriting" issue.

Thanks! Rob

pditommaso commented 3 years ago

What's the name of the task affected by the above problem?

rob-p commented 3 years ago

The problem appears in multiple cached tasks, but all of them right now seem to be the ss_process workflow that invokes the star_solo_quant process. Interestingly, for some specific tasks (combinations of running the process on a given dataset), only certain files appear to be overwritten by the links, while for other specific combinations, all of the files in that working directory are overwritten.

pditommaso commented 3 years ago

Does this happen also when running in stub mode?

rob-p commented 3 years ago

I've not had the opportunity to make use of -resume in stub mode, as the pipeline ran to completion without issue. However, when the pipeline finishes in a single run, all of the relevant files seem to exist properly in the work directory, and are linked properly from the publishDir. So, I believe this issue only arises when using -resume to restart a workflow after a task fails (or after the workflow is otherwise interrupted).

pditommaso commented 3 years ago

It could be related to the use of mode: 'rellink' in the publishDir, try to remove it. Can't say much more with the exact command to replicate the problem.

rob-p commented 3 years ago

Thanks for the suggestion. So this would fall back to symlink, right? I'd made use of rellink because in the past symlink had caused a Too many levels of symbolic links when attempting to publish to the publishDir during the development of the pipeline. Now it potentially makes sense that these two might be related ... 🤔 .

pditommaso commented 3 years ago

in the past symlink had caused a Too many levels of symbolic links

This should not happen too, tasks should (generally) create concrete files, not symlinks. Are tasks in your pipeline writing outside the directory assigned by NF?

rob-p commented 3 years ago

They read from external paths (e.g indices), but only write to the output directories specified in the output tag.

rob-p commented 3 years ago

Hi @pditommaso,

So after digging deeper and re-factoring the output of some of the tasks, the issue (both with rellinks and with symlinks) seems to be related to files appearing more than one time in the output of a process. Does this make sense as the potential cause? There were processes where files were potentially matches by multiple globs, and so would appear > 1 time in the aggregate output of the process. If I am careful to make sure that doesn't happen, I don't seem to observe the issue.

Is the duplicate listing of the file in the output a known problem for the linking mechanisms? I realize it's not ideal to have it anyway (resulting from lazy globbing to make sure everything we want appears in the publishDir), but I was unable to find this listed as an issue in the documentation so I want to make sure I didn't just overlook it.

Thanks! Rob

pditommaso commented 3 years ago

Having more than one output for the same file is allowed. Not understanding the cause of the problem. Really keen to help but I should have a one-liner command to replicate the issue.

drpatelh commented 3 years ago

As discussed on nf-core Slack in #help yesterday, I believe the issue was arising as a result of output a directory and one or more of its nested files as separate entries like below:

output:
    path "quast"
    path "quast/report.tsv"

Is that right @rob-p ? What mode are you using when setting publishDir?

We had a chat about this on nf-core Slack in #viralrecon (June 2020) @pditommaso because I had a similar issue but in my case only the quast/report.tsv was being published as opposed to the entire directory contents (with mode: copy). You suggested it may have something to do with the shared parent path messing things up.

pditommaso commented 3 years ago

it may have something to do with the shared parent path messing things up.

it sounds possible, we need a test case to debug the problem.

rob-p commented 3 years ago

Thanks @drpatelh and @pditommaso. Yes, @drpatelh, I believe your characterization was precisely the case that was causing the issue. The frustrating thing is that the issue itself was rather stochastic (some files were recursive links, while others weren't). When I get a chance, I'll see if I can make / help make a MWE.

rob-p commented 3 years ago

Hi @pditommaso,

Ok, this will repro the behavior for me (https://gist.github.com/rob-p/11e1d0d9b76f28816a9fc89224835d68). I run this as

$./nextflow run main.nf -stub

I used stubs just to check that it does also happen in stub mode. When I then go to examine the output, I get:

image

So, as you can see, the first file is created fine, but on the second, I get the too many level of symbolic links error. So, I think this reproduced the minimal core issue at the center of the behavior I was seeing in the larger pipeline.

The execution log suggests no particular issues.

pditommaso commented 3 years ago

Gotcha! working on it

pditommaso commented 3 years ago

Quite a tricky thing, the problem arises when declaring as output a directory AND a nested file. What happens is that first creates a symlink to the directory in the declared publishDir eg

/work/xx/yy/1_foo -[SYMLINK]-> /output/1_foo

then tries to create the link for the nested file eg

/work/xx/yy/1_foo/bar.txt -[SYMLINK]-> /output/1_foo/bar.txt

BUT the path /output/1_foo/bar.txt already exists given the previous link, therefore an exception FileAlreadyExistsException is thrown, therefore NF deletes it, but doing so it deletes the original file .. and all the mess starts.

I'm investigating for a proper solution, as a workaround in the short term avoid the nested output declaration.

rob-p commented 3 years ago

Thanks @pditommaso --- great; I'm glad the MWE reproduces the issue for you, and that you're already on the trail of the underlying issue. I figured it would be something subtle / tricky like this. I've also already started adopting your proposed work-around. At least in the cases that issues arose so far, the nested inclusion in the output rules was just a result of my being lazy in declaring the output. I imagine there may be legitimate cases when a pattern like this arises, but I've not encountered one yet. I'll keep tracking the issue and am happy to discuss proposed solutions (though I'm unfamiliar with the underlying nextflow code, so any thoughts or feedback I have would be rather generic / abstract).

pditommaso commented 3 years ago

I think I've found a good solution computing a prefix tree a creating a link only for the path that does not overlap, eg

if the outputs result copy the files:

Only the first should be taken. But still tricky situations can arise if changing the pattern of the outputs across restarts or the target path contains already a symlink.

pditommaso commented 3 years ago

Uploaded a patch 190aa4eaf.

pditommaso commented 3 years ago

This patch has been merged on master and it will be included in the next release. Thanks a lot for spending time to report it, it was quite a tricky issue.

Closing this ticket. Feel free to comment or reopen if needed.

rob-p commented 3 years ago

Thanks @pditommaso for the super-quick response time and for tracking down and resolving the issue. While I am not very familiar with the internals of nextflow, I did take the time to read over the patch, and it certainly seems like a relatively clean and robust solution to the problem!