Closed rob-p closed 3 years ago
I suspect there's something wrong in your pipeline script. Is the code avail or even better do you have a test case that replicates the problem?
Hi @pditommaso,
Thank you for the super-quick response. I've added you as a collaborator with view permissions on the repository that contains the pipeline. Once the analyses are done (and the code cleaned up) this will be a public repository, but it's private for the time being. It is absolutely possible that there is something wrong in our pipeline script. I (and all in my lab) are relative newbies to nextflow and DSL2, and so this is one of the first non-trivial pipelines we've written. I am sure there are many places where we are avoiding best practices (and making outright mistakes in feature usage). Any suggestions you can provide would be greatly appreciated, but especially any suggestions on how to avoid this "recursive link overwriting" issue.
Thanks! Rob
What's the name of the task affected by the above problem?
The problem appears in multiple cached tasks, but all of them right now seem to be the ss_process
workflow that invokes the star_solo_quant
process. Interestingly, for some specific tasks (combinations of running the process on a given dataset), only certain files appear to be overwritten by the links, while for other specific combinations, all of the files in that working directory are overwritten.
Does this happen also when running in stub mode?
I've not had the opportunity to make use of -resume
in stub mode, as the pipeline ran to completion without issue. However, when the pipeline finishes in a single run, all of the relevant files seem to exist properly in the work directory, and are linked properly from the publishDir
. So, I believe this issue only arises when using -resume
to restart a workflow after a task fails (or after the workflow is otherwise interrupted).
It could be related to the use of mode: 'rellink'
in the publishDir
, try to remove it. Can't say much more with the exact command to replicate the problem.
Thanks for the suggestion. So this would fall back to symlink
, right? I'd made use of rellink
because in the past symlink had caused a Too many levels of symbolic links
when attempting to publish to the publishDir
during the development of the pipeline. Now it potentially makes sense that these two might be related ... 🤔 .
in the past symlink had caused a Too many levels of symbolic links
This should not happen too, tasks should (generally) create concrete files, not symlinks. Are tasks in your pipeline writing outside the directory assigned by NF?
They read from external paths (e.g indices), but only write to the output directories specified in the output tag.
Hi @pditommaso,
So after digging deeper and re-factoring the output of some of the tasks, the issue (both with rellinks and with symlinks) seems to be related to files appearing more than one time in the output
of a process. Does this make sense as the potential cause? There were processes where files were potentially matches by multiple globs, and so would appear > 1 time in the aggregate output of the process. If I am careful to make sure that doesn't happen, I don't seem to observe the issue.
Is the duplicate listing of the file in the output a known problem for the linking mechanisms? I realize it's not ideal to have it anyway (resulting from lazy globbing to make sure everything we want appears in the publishDir
), but I was unable to find this listed as an issue in the documentation so I want to make sure I didn't just overlook it.
Thanks! Rob
Having more than one output for the same file is allowed. Not understanding the cause of the problem. Really keen to help but I should have a one-liner command to replicate the issue.
As discussed on nf-core Slack in #help yesterday, I believe the issue was arising as a result of output
a directory and one or more of its nested files as separate entries like below:
output:
path "quast"
path "quast/report.tsv"
Is that right @rob-p ? What mode
are you using when setting publishDir
?
We had a chat about this on nf-core Slack in #viralrecon (June 2020) @pditommaso because I had a similar issue but in my case only the quast/report.tsv
was being published as opposed to the entire directory contents (with mode: copy
). You suggested it may have something to do with the shared parent path messing things up.
it may have something to do with the shared parent path messing things up.
it sounds possible, we need a test case to debug the problem.
Thanks @drpatelh and @pditommaso. Yes, @drpatelh, I believe your characterization was precisely the case that was causing the issue. The frustrating thing is that the issue itself was rather stochastic (some files were recursive links, while others weren't). When I get a chance, I'll see if I can make / help make a MWE.
Hi @pditommaso,
Ok, this will repro the behavior for me (https://gist.github.com/rob-p/11e1d0d9b76f28816a9fc89224835d68). I run this as
$./nextflow run main.nf -stub
I used stubs just to check that it does also happen in stub mode. When I then go to examine the output, I get:
So, as you can see, the first file is created fine, but on the second, I get the too many level of symbolic links
error. So, I think this reproduced the minimal core issue at the center of the behavior I was seeing in the larger pipeline.
The execution log suggests no particular issues.
Gotcha! working on it
Quite a tricky thing, the problem arises when declaring as output a directory AND a nested file. What happens is that first creates a symlink to the directory in the declared publishDir eg
/work/xx/yy/1_foo -[SYMLINK]-> /output/1_foo
then tries to create the link for the nested file eg
/work/xx/yy/1_foo/bar.txt -[SYMLINK]-> /output/1_foo/bar.txt
BUT the path /output/1_foo/bar.txt
already exists given the previous link, therefore an exception FileAlreadyExistsException
is thrown, therefore NF deletes it, but doing so it deletes the original file .. and all the mess starts.
I'm investigating for a proper solution, as a workaround in the short term avoid the nested output declaration.
Thanks @pditommaso --- great; I'm glad the MWE reproduces the issue for you, and that you're already on the trail of the underlying issue. I figured it would be something subtle / tricky like this. I've also already started adopting your proposed work-around. At least in the cases that issues arose so far, the nested inclusion in the output rules was just a result of my being lazy in declaring the output. I imagine there may be legitimate cases when a pattern like this arises, but I've not encountered one yet. I'll keep tracking the issue and am happy to discuss proposed solutions (though I'm unfamiliar with the underlying nextflow code, so any thoughts or feedback I have would be rather generic / abstract).
I think I've found a good solution computing a prefix tree a creating a link only for the path that does not overlap, eg
if the outputs result copy the files:
Only the first should be taken. But still tricky situations can arise if changing the pattern of the outputs across restarts or the target path contains already a symlink.
Uploaded a patch 190aa4eaf.
This patch has been merged on master and it will be included in the next release. Thanks a lot for spending time to report it, it was quite a tricky issue.
Closing this ticket. Feel free to comment or reopen if needed.
Thanks @pditommaso for the super-quick response time and for tracking down and resolving the issue. While I am not very familiar with the internals of nextflow, I did take the time to read over the patch, and it certainly seems like a relatively clean and robust solution to the problem!
Bug report
We've been testing out nextflow for putting together a pipeline for running some different scRNA-seq processing tools. The ecosystem has been incredible and the reporting and introspection capabilities have been amazing. Also nf-tower has been a game changer! With that out of the way, I'm experiencing the following bug which is pretty critical, especially was we are doing pipeline runs that take many hours and execute many tools.
The behavior we're observing is that when "resuming" a run with the
-resume
flag, the appropriate processes are flagged as cached, however the actual underlying files for the cached process seem to be gone. The file names show up in the published directory, but they link to files in the work directory (which is fine). However, the files in the work directory link back to themselves ... recursively ... and so the actual file content seems to be blown away.Expected behavior and actual behavior
Expected behavior: The original files that constitute the output of the cached run remain in tact in the work directory. Actual behavior: The original files are overwritten with links to themselves (see the screenshot below):
Steps to reproduce the problem
Use the
-resume
flag to resume a workflow where previous steps have succeeded and are cached but other steps remain to be executed. It seems that only cached steps are affected by this problem.Program output
There is no unexpected output for the process. The log correctly shows these processes as cached, but the output files are affected.
Environment
nextflow version 21.04.1.5556
This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.