Closed matthew-valentine closed 3 years ago
Is anyone able to help with this? I am still unable to run nextflow using SGE and I can't figure out why it won't complete.
Try the following: add set -x
at the beginning of the .commad.run
, then submit the job with qsub and the the console output here
I have been playing around with the output of this, and also the output when running the command locally and from what I can see, the qsubbed version gets stuck waiting for nxf_launch to complete, while locally this happens almost instantly. I then started looking into singularity exec with my container image and found that when I run it locally, it doesn't seem to work e.g. if I run singularity exec image ls it won't execute the ls command at all.
Now, when I run it on the server it is a different story. Singularity begins executing, and converts the image file to a sandbox for running the script, outputting:
INFO: Convert SIF file to sandbox...
It then executes the ls command. After this it moves onto the next step:
INFO: Cleaning up image...
At this point, no matter how long I leave it, it just keeps on running this step. Now I had a look at the processes running with ps -ef. During the first conversion step I see the set of processes running to make singularity go:
-bash (PID: 13728, PPID: 13727) singularity exec nanocompore_pipeline.img ls (PID: 13817, PPID: 13728) /usr/bin/unsquashfs -user-xattrs -f -d /tmp/rootfs-900540057 /tmp/archive-009741604 (PID: 13833, PPID: 13817)
After the ls command has executed and it is stuck in the "Cleaning up image" step:
-bash (PID: 13728, PPID: 13727)
[starter]
This defunct process won't go away until I kill the parent. So it seems the hanging issue is being caused by the hanging at this stage. Running singularity in debug mode I get a "Child exited with exit status 0" message after the "Cleaning up image" message, but then nothing else, so I am going to try and figure out what is going on there.
This is tricky to debug. If you still have the task files replace the content of .command.sh
with
echo "=== shell: $SHELL"
echo "=== bash : $(bash --version)"
Then submit again the job using this command qsub -v NXF_DEBUG=1 .command.run
, then copy here the files .command.log
, .command.out
, .command.err
and the console stdout.
This is tricky to debug. If you still have the task files replace the content of
.command.sh
withecho "=== shell: $SHELL" echo "=== bash : $(bash --version)"
Then submit again the job using this command
qsub -v NXF_DEBUG=1 .command.run
, then copy here the files.command.log
,.command.out
,.command.err
and the console stdout.
I do indeed have the task files so I have just tried it and have attached the files created:
command.err.txt command.log.txt command.out.txt
As for the singularity issue that I thought was the cause, I don't actually think it is. I was attempting to debug it before Christmas, but by the time I picked it up again after Christmas there are no longer any problems and singularity can complete it's job with no problems. However, the issue I'm having when running nextflow persists.
I do believe it may be related to the wait stage in nxf_main as I mentioned in my original post because I see this with ps xf
If the tee commands have become zombies then I guess the wait $tee1 $tee2
won't ever complete? This would then prevent nxf_unstage and completion of the job.
Don't think the problem are the tee
zombies, instead, the problem should be related to .command.run nxf_trace
process.
We need to isolate the issue with a minimal test case. Please try to replicate the problem using this simple pipeline just running a 10 second sleep. try the following commands using the local executor
nextflow run https://github.com/pditommaso/nf-sleep
nextflow run https://github.com/pditommaso/nf-sleep -with-trace
nextflow run https://github.com/pditommaso/nf-sleep -with-singularity
nextflow run https://github.com/pditommaso/nf-sleep -with-singularity -with-trace
Which one runs or fails?
I should start by explaining that we have two distinct sets of servers at work. In my OP I mentioned that when I run the pipeline directly on the server that it works with no problems. The server I was referring to is one we are only supposed to run small jobs on, so although the pipeline works on there I can't really use it. If I run any of the tasks above on this server they complete successfully, as expected.
On the other set of servers we can also run jobs directly, as well as run via qsub. Running using the local executor here I have the same problem as when I run using the sge executor. Regardless of which test case I run, they all just keep running.
Ok, this means we can focus only on the local executor and it's not a matter of the SGE cluster. Can you confirm that the problem arises will the above four command lines?
I just checked again and the problem indeed arises with all four of the commands. I wondered whether it was maybe something about the number of cpus being requested as in my original config file I was requesting three but for good measure I changed it to only requesting one and the problem persisted.
This requires some patience to troubleshoot. Please copy and & the output of these command run in the offending computer:
$SHELL --version
and
date +%s%3N
Then run
NXF_DEBUG=2 nextflow run https://github.com/pditommaso/nf-sleep --timeout 5
When it hangs after 5 seconds, kill it and tar the working directory of the sole task executed and attach it here. Thanks.
The output of $SHELL --version
:
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
The output of date +%s%3N
:
1610155138388
And here is the working directory that I ran the nextflow task in:
Thank you so much for all of your assistance so far!
It looks like it hangs when waiting for the tee
to terminate.
I'm starting to think it's some kind of bug in the Bash version you are using or something related. A couple of things:
1) could you also please verify the exact version of Bash in the system which is working correctly? use the command
$SHELL --version
2) could you edit to modify the .command.run
that hangs replacing the line at the bottom
wait $tee1 $tee2
with
tail --pid=$tee1 -f /dev/null
tail --pid=$tee2 -f /dev/null
Then execute it again, using the command
bash -x .command.run
If it hangs again, put the process in the background (CTRL+Z) and run ps
to check the current processes status (please copy here both .command.run
and ps
stdout)
The version of bash seems to be identical on the server where it works fine:
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
I changed the run file and it did hang again:
bash -x .command.run
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.QF8fpSIPte
+ local cout=/dev/shm/nxf.QF8fpSIPte/.command.out
+ mkfifo /dev/shm/nxf.QF8fpSIPte/.command.out
+ local cerr=/dev/shm/nxf.QF8fpSIPte/.command.err
+ mkfifo /dev/shm/nxf.QF8fpSIPte/.command.err
+ tee1=42184
+ tee .command.out
+ tee2=42185
+ tee .command.err
+ pid=42186
+ wait 42186
Hello (timeout 5)
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
+ exit 0
+ tail --pid=42184 -f /dev/null
Here is the output from ps:
PID TTY TIME CMD
42070 pts/0 00:00:00 bash
42176 pts/0 00:02:03 bash
42184 pts/0 00:00:00 tee <defunct>
42185 pts/0 00:00:00 tee <defunct>
42189 pts/0 00:00:00 tail
42190 pts/0 00:00:00 ps
Anyhow you are right, the problem is the tee
that remains in zombie status. I think I'm going to remove that wait
, however, to better identify the problem could you please add just before that tail
that was added the following
kill -CHLD $$
sleep 0.1
run it again with bash -x .command.run
and paste the stdout here. thanks a lot.
Here is the stdout:
bash -x .command.run
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.yFKflwgW99
+ local cout=/dev/shm/nxf.yFKflwgW99/.command.out
+ mkfifo /dev/shm/nxf.yFKflwgW99/.command.out
+ local cerr=/dev/shm/nxf.yFKflwgW99/.command.err
+ mkfifo /dev/shm/nxf.yFKflwgW99/.command.err
+ tee1=47176
+ tee .command.out
+ tee2=47177
+ tee .command.err
+ pid=47178
+ wait 47178
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
Hello (timeout 5)
+ exit 0
+ kill -CHLD 47168
+ sleep 0.1
At this point it just hangs as usual.
Weird, that the last statement is not tail
as in the previous one.
In any case I've attached below a possible patch. Please, unzip and copy the script in the /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/
directory. Then run with bash -x command-patched.run
and copy the stdout as before. If it hangs again please add also the output of ps f
So this is stdout for the patch:
bash -x command-patched.run
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.441xmy43g5
+ local cout=/dev/shm/nxf.441xmy43g5/.command.out
+ mkfifo /dev/shm/nxf.441xmy43g5/.command.out
+ local cerr=/dev/shm/nxf.441xmy43g5/.command.err
+ mkfifo /dev/shm/nxf.441xmy43g5/.command.err
+ tee1=36027
+ tee .command.out
+ tee2=36028
+ tee .command.err
+ pid=36029
+ wait 36029
Hello (timeout 5)
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
+ exit 0
+ nxf_await 36027 1
+ local pid=36027
+ local max=1
+ timeout 1 tail --pid=36027 -f /dev/null
It hanged so here is the output of ps f
:
PID TTY STAT TIME COMMAND
35923 pts/4 Ss 0:00 -bash
36019 pts/4 T 0:57 \_ bash -x command-patched.run
36032 pts/4 Z 0:00 | \_ [timeout] <defunct>
36036 pts/4 R+ 0:00 \_ ps f
This is surprising, now it's the timeout that remains in zombie status and hangs the script. Really not understanding what's happening here. Any sysadmin can assist you here?
I discussed with them a long time back and they were trying to figure out what was the cause for it not working. At that time they thought it might have been a problem with local ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
because /dev/shm may be too small, but considering .command.out and .command.err are both written to correctly in the following lines I figured that wasn't the case.
I'll see if they have any more insight now though!
I've uploaded two other scripts to try understanding what is hanging the execution:
command-test1.run
removed the wait at the tee
commands command-test2.run
uses /tmp
instead of /dev/shm
for creating the temp pipes If you could run both with the usual command ie bash -x <script>
and report the stdout and the ps f
for each of them it could help.
Okay so here is the stdout for command-test1.run
:
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.05Ds3UJ3qK
+ local cout=/dev/shm/nxf.05Ds3UJ3qK/.command.out
+ mkfifo /dev/shm/nxf.05Ds3UJ3qK/.command.out
+ local cerr=/dev/shm/nxf.05Ds3UJ3qK/.command.err
+ mkfifo /dev/shm/nxf.05Ds3UJ3qK/.command.err
+ tee1=44345
+ tee .command.out
+ tee2=44346
+ tee .command.err
+ pid=44347
+ wait 44347
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
Hello (timeout 5)
+ exit 0
+ nxf_unstage
+ true
+ [[ 0 != 0 ]]
And here is the ps f
:
PID TTY STAT TIME COMMAND
44243 pts/7 Ss 0:00 -bash
44337 pts/7 T 0:57 \_ bash -x command-test1.run
44345 pts/7 Z 0:00 | \_ [tee] <defunct>
44346 pts/7 Z 0:00 | \_ [tee] <defunct>
44350 pts/7 R+ 0:00 \_ ps f
Then for command-test2.run
:
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /tmp
+ local ctmp=/tmp/nxf.iYPteQwn2D
+ local cout=/tmp/nxf.iYPteQwn2D/.command.out
+ mkfifo /tmp/nxf.iYPteQwn2D/.command.out
+ local cerr=/tmp/nxf.iYPteQwn2D/.command.err
+ mkfifo /tmp/nxf.iYPteQwn2D/.command.err
+ tee1=44379
+ tee .command.out
+ tee2=44380
+ tee .command.err
+ pid=44381
+ wait 44381
Hello (timeout 5)
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
+ exit 0
+ nxf_await 44379 1
+ local pid=44379
+ local max=1
+ timeout 1 tail --pid=44379 -f /dev/null
And again the ps f
:
PID TTY STAT TIME COMMAND
44243 pts/7 Ss 0:00 -bash
44371 pts/7 T 1:12 \_ bash -x command-test2.run
44379 pts/7 Z 0:00 | \_ [tee] <defunct>
44380 pts/7 Z 0:00 | \_ [tee] <defunct>
44384 pts/7 Z 0:00 | \_ [timeout] <defunct>
44386 pts/7 R+ 0:00 \_ ps f
I tried running command-test1.run with the temp pipes being directed to /tmp
just in case but the result was the same as using /dev/shm
Therefore both of them still hang, right?
Sadly yes.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, I came here from the Internet and I think I have a similar issue. My processes hang indefinitely but only some of them and only shovill which uses tee
. Was any solution found?
nextflow version 20.07.1.5412
I have a nextflow pipeline that works when run on my work servers directly, but I am trying to get it working with qsub as that is the preferred way of doing things here. I have been having problems where the initial processes described in the pipeline complete, with all the expected files being generated, but then it doesn't progress to subsequent processes. In order to try and figure out what is going on I made a simple pipeline:
process test { """ echo "Hello world" > out.test """ }
my nextflow.config file has the following to say about sge:
process { executor = 'sge' queue = 'bigmem.q' clusterOptions = '-S /bin/bash' }
When I run this, I instantly get the output file being generated, but no matter how long I leave it, the process will never end. When I run it directly on my server it all finishes as expected. So the reason why the later steps don't start in my real pipeline is that it can't tell that the initial steps have finished.
I started testing this by using sub directly with the .command.run file. I found that the problem seemed to be around the 'wait $pid || nxf_main_ret=$?' step within the nxf_main process, so steps like nxf_unstage never occur. As it seems that the nxf_launch process never "completes", there is no .exitcode file generated by on_exit. I have attached the folder generated by nextflow in case that helps.
nextflow_test.tar.gz
Many thanks in advance for any help you can give!