Nextflow pipeline on SGE with singularity hangs indefinitely

matthew-valentine commented 4 years ago

nextflow version 20.07.1.5412

I have a nextflow pipeline that works when run on my work servers directly, but I am trying to get it working with qsub as that is the preferred way of doing things here. I have been having problems where the initial processes described in the pipeline complete, with all the expected files being generated, but then it doesn't progress to subsequent processes. In order to try and figure out what is going on I made a simple pipeline:

process test { """ echo "Hello world" > out.test """ }

my nextflow.config file has the following to say about sge:

process { executor = 'sge' queue = 'bigmem.q' clusterOptions = '-S /bin/bash' }

When I run this, I instantly get the output file being generated, but no matter how long I leave it, the process will never end. When I run it directly on my server it all finishes as expected. So the reason why the later steps don't start in my real pipeline is that it can't tell that the initial steps have finished.

I started testing this by using sub directly with the .command.run file. I found that the problem seemed to be around the 'wait $pid || nxf_main_ret=$?' step within the nxf_main process, so steps like nxf_unstage never occur. As it seems that the nxf_launch process never "completes", there is no .exitcode file generated by on_exit. I have attached the folder generated by nextflow in case that helps.

nextflow_test.tar.gz

Many thanks in advance for any help you can give!

matthew-valentine commented 3 years ago

Is anyone able to help with this? I am still unable to run nextflow using SGE and I can't figure out why it won't complete.

pditommaso commented 3 years ago

Try the following: add set -x at the beginning of the .commad.run, then submit the job with qsub and the the console output here

matthew-valentine commented 3 years ago

Thanks for getting back to me! ^.^

I have done as requested, and here is the output:

qsub_out.txt

matthew-valentine commented 3 years ago

I have been playing around with the output of this, and also the output when running the command locally and from what I can see, the qsubbed version gets stuck waiting for nxf_launch to complete, while locally this happens almost instantly. I then started looking into singularity exec with my container image and found that when I run it locally, it doesn't seem to work e.g. if I run singularity exec image ls it won't execute the ls command at all.

Now, when I run it on the server it is a different story. Singularity begins executing, and converts the image file to a sandbox for running the script, outputting:

INFO: Convert SIF file to sandbox...

It then executes the ls command. After this it moves onto the next step:

INFO: Cleaning up image...

At this point, no matter how long I leave it, it just keeps on running this step. Now I had a look at the processes running with ps -ef. During the first conversion step I see the set of processes running to make singularity go:

-bash (PID: 13728, PPID: 13727) singularity exec nanocompore_pipeline.img ls (PID: 13817, PPID: 13728) /usr/bin/unsquashfs -user-xattrs -f -d /tmp/rootfs-900540057 /tmp/archive-009741604 (PID: 13833, PPID: 13817)

After the ls command has executed and it is stuck in the "Cleaning up image" step:

-bash (PID: 13728, PPID: 13727) [starter] (PID: 13817, PPID: 13728)

This defunct process won't go away until I kill the parent. So it seems the hanging issue is being caused by the hanging at this stage. Running singularity in debug mode I get a "Child exited with exit status 0" message after the "Cleaning up image" message, but then nothing else, so I am going to try and figure out what is going on there.

pditommaso commented 3 years ago

This is tricky to debug. If you still have the task files replace the content of .command.sh with

echo "=== shell: $SHELL"
echo "=== bash : $(bash --version)"

Then submit again the job using this command qsub -v NXF_DEBUG=1 .command.run, then copy here the files .command.log, .command.out, .command.err and the console stdout.

matthew-valentine commented 3 years ago

This is tricky to debug. If you still have the task files replace the content of .command.sh with
echo "=== shell: $SHELL"
echo "=== bash : $(bash --version)"
Then submit again the job using this command qsub -v NXF_DEBUG=1 .command.run, then copy here the files .command.log, .command.out, .command.err and the console stdout.

I do indeed have the task files so I have just tried it and have attached the files created:

command.err.txt command.log.txt command.out.txt

As for the singularity issue that I thought was the cause, I don't actually think it is. I was attempting to debug it before Christmas, but by the time I picked it up again after Christmas there are no longer any problems and singularity can complete it's job with no problems. However, the issue I'm having when running nextflow persists.

matthew-valentine commented 3 years ago

I do believe it may be related to the wait stage in nxf_main as I mentioned in my original post because I see this with ps xf

If the tee commands have become zombies then I guess the wait $tee1 $tee2 won't ever complete? This would then prevent nxf_unstage and completion of the job.

pditommaso commented 3 years ago

Don't think the problem are the teezombies, instead, the problem should be related to .command.run nxf_trace process.

We need to isolate the issue with a minimal test case. Please try to replicate the problem using this simple pipeline just running a 10 second sleep. try the following commands using the local executor

nextflow run https://github.com/pditommaso/nf-sleep
nextflow run https://github.com/pditommaso/nf-sleep -with-trace
nextflow run https://github.com/pditommaso/nf-sleep -with-singularity
nextflow run https://github.com/pditommaso/nf-sleep -with-singularity -with-trace

Which one runs or fails?

matthew-valentine commented 3 years ago

I should start by explaining that we have two distinct sets of servers at work. In my OP I mentioned that when I run the pipeline directly on the server that it works with no problems. The server I was referring to is one we are only supposed to run small jobs on, so although the pipeline works on there I can't really use it. If I run any of the tasks above on this server they complete successfully, as expected.

On the other set of servers we can also run jobs directly, as well as run via qsub. Running using the local executor here I have the same problem as when I run using the sge executor. Regardless of which test case I run, they all just keep running.

pditommaso commented 3 years ago

Ok, this means we can focus only on the local executor and it's not a matter of the SGE cluster. Can you confirm that the problem arises will the above four command lines?

matthew-valentine commented 3 years ago

I just checked again and the problem indeed arises with all four of the commands. I wondered whether it was maybe something about the number of cpus being requested as in my original config file I was requesting three but for good measure I changed it to only requesting one and the problem persisted.

pditommaso commented 3 years ago

This requires some patience to troubleshoot. Please copy and & the output of these command run in the offending computer:

$SHELL --version

and

date +%s%3N

Then run

NXF_DEBUG=2 nextflow run https://github.com/pditommaso/nf-sleep --timeout 5

When it hangs after 5 seconds, kill it and tar the working directory of the sole task executed and attach it here. Thanks.

matthew-valentine commented 3 years ago

The output of $SHELL --version:

GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

The output of date +%s%3N:

1610155138388

And here is the working directory that I ran the nextflow task in:

nf-test.tar.gz

Thank you so much for all of your assistance so far!

pditommaso commented 3 years ago

It looks like it hangs when waiting for the tee to terminate.

https://github.com/nextflow-io/nextflow/blob/671ae6d85df44f906747c16f6d73208dbc402d49/modules/nextflow/src/main/resources/nextflow/executor/command-run.txt#L131

I'm starting to think it's some kind of bug in the Bash version you are using or something related. A couple of things:

1) could you also please verify the exact version of Bash in the system which is working correctly? use the command

$SHELL --version

2) could you edit to modify the .command.run that hangs replacing the line at the bottom

wait $tee1 $tee2

with

tail --pid=$tee1 -f /dev/null
tail --pid=$tee2 -f /dev/null

Then execute it again, using the command

bash -x .command.run

If it hangs again, put the process in the background (CTRL+Z) and run ps to check the current processes status (please copy here both .command.run and ps stdout)

matthew-valentine commented 3 years ago

The version of bash seems to be identical on the server where it works fine:

GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I changed the run file and it did hang again:

bash -x .command.run 
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.QF8fpSIPte
+ local cout=/dev/shm/nxf.QF8fpSIPte/.command.out
+ mkfifo /dev/shm/nxf.QF8fpSIPte/.command.out
+ local cerr=/dev/shm/nxf.QF8fpSIPte/.command.err
+ mkfifo /dev/shm/nxf.QF8fpSIPte/.command.err
+ tee1=42184
+ tee .command.out
+ tee2=42185
+ tee .command.err
+ pid=42186
+ wait 42186
Hello (timeout 5)
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
+ exit 0
+ tail --pid=42184 -f /dev/null

Here is the output from ps:

  PID TTY          TIME CMD
42070 pts/0    00:00:00 bash
42176 pts/0    00:02:03 bash
42184 pts/0    00:00:00 tee <defunct>
42185 pts/0    00:00:00 tee <defunct>
42189 pts/0    00:00:00 tail
42190 pts/0    00:00:00 ps

pditommaso commented 3 years ago

Anyhow you are right, the problem is the tee that remains in zombie status. I think I'm going to remove that wait, however, to better identify the problem could you please add just before that tail that was added the following

kill -CHLD $$
sleep 0.1

run it again with bash -x .command.run and paste the stdout here. thanks a lot.

matthew-valentine commented 3 years ago

Here is the stdout:

bash -x .command.run
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.yFKflwgW99
+ local cout=/dev/shm/nxf.yFKflwgW99/.command.out
+ mkfifo /dev/shm/nxf.yFKflwgW99/.command.out
+ local cerr=/dev/shm/nxf.yFKflwgW99/.command.err
+ mkfifo /dev/shm/nxf.yFKflwgW99/.command.err
+ tee1=47176
+ tee .command.out
+ tee2=47177
+ tee .command.err
+ pid=47178
+ wait 47178
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
Hello (timeout 5)
+ exit 0
+ kill -CHLD 47168
+ sleep 0.1

At this point it just hangs as usual.

pditommaso commented 3 years ago

Weird, that the last statement is not tail as in the previous one.

In any case I've attached below a possible patch. Please, unzip and copy the script in the /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/ directory. Then run with bash -x command-patched.run and copy the stdout as before. If it hangs again please add also the output of ps f

command-patched.run.zip

matthew-valentine commented 3 years ago

So this is stdout for the patch:

bash -x command-patched.run 
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.441xmy43g5
+ local cout=/dev/shm/nxf.441xmy43g5/.command.out
+ mkfifo /dev/shm/nxf.441xmy43g5/.command.out
+ local cerr=/dev/shm/nxf.441xmy43g5/.command.err
+ mkfifo /dev/shm/nxf.441xmy43g5/.command.err
+ tee1=36027
+ tee .command.out
+ tee2=36028
+ tee .command.err
+ pid=36029
+ wait 36029
Hello (timeout 5)
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
+ exit 0
+ nxf_await 36027 1
+ local pid=36027
+ local max=1
+ timeout 1 tail --pid=36027 -f /dev/null

It hanged so here is the output of ps f:

  PID TTY      STAT   TIME COMMAND
35923 pts/4    Ss     0:00 -bash
36019 pts/4    T      0:57  \_ bash -x command-patched.run
36032 pts/4    Z      0:00  |   \_ [timeout] <defunct>
36036 pts/4    R+     0:00  \_ ps f

pditommaso commented 3 years ago

This is surprising, now it's the timeout that remains in zombie status and hangs the script. Really not understanding what's happening here. Any sysadmin can assist you here?

matthew-valentine commented 3 years ago

I discussed with them a long time back and they were trying to figure out what was the cause for it not working. At that time they thought it might have been a problem with local ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR) because /dev/shm may be too small, but considering .command.out and .command.err are both written to correctly in the following lines I figured that wasn't the case.

I'll see if they have any more insight now though!

pditommaso commented 3 years ago

I've uploaded two other scripts to try understanding what is hanging the execution:

command-test1.run removed the wait at the tee commands
command-test2.run uses /tmp instead of /dev/shm for creating the temp pipes

If you could run both with the usual command ie bash -x <script> and report the stdout and the ps f for each of them it could help.

command-tests.zip

matthew-valentine commented 3 years ago

Okay so here is the stdout for command-test1.run:

+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /dev/shm
+ local ctmp=/dev/shm/nxf.05Ds3UJ3qK
+ local cout=/dev/shm/nxf.05Ds3UJ3qK/.command.out
+ mkfifo /dev/shm/nxf.05Ds3UJ3qK/.command.out
+ local cerr=/dev/shm/nxf.05Ds3UJ3qK/.command.err
+ mkfifo /dev/shm/nxf.05Ds3UJ3qK/.command.err
+ tee1=44345
+ tee .command.out
+ tee2=44346
+ tee .command.err
+ pid=44347
+ wait 44347
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
Hello (timeout 5)
+ exit 0
+ nxf_unstage
+ true
+ [[ 0 != 0 ]]

And here is the ps f:

  PID TTY      STAT   TIME COMMAND
44243 pts/7    Ss     0:00 -bash
44337 pts/7    T      0:57  \_ bash -x command-test1.run
44345 pts/7    Z      0:00  |   \_ [tee] <defunct>
44346 pts/7    Z      0:00  |   \_ [tee] <defunct>
44350 pts/7    R+     0:00  \_ ps f

Then for command-test2.run:

+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ [[ -n '' ]]
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
+ true
+ set +e
++ set +u
++ nxf_mktemp /tmp
+ local ctmp=/tmp/nxf.iYPteQwn2D
+ local cout=/tmp/nxf.iYPteQwn2D/.command.out
+ mkfifo /tmp/nxf.iYPteQwn2D/.command.out
+ local cerr=/tmp/nxf.iYPteQwn2D/.command.err
+ mkfifo /tmp/nxf.iYPteQwn2D/.command.err
+ tee1=44379
+ tee .command.out
+ tee2=44380
+ tee .command.err
+ pid=44381
+ wait 44381
Hello (timeout 5)
+ nxf_launch
+ /bin/bash -uex /home/matthew/nf-test/work/f0/c7451410f9a021562a2a0af31f2596/.command.sh
+ echo 'Hello (timeout 5)'
+ sleep 5
+ exit 0
+ nxf_await 44379 1
+ local pid=44379
+ local max=1
+ timeout 1 tail --pid=44379 -f /dev/null

And again the ps f:

  PID TTY      STAT   TIME COMMAND
44243 pts/7    Ss     0:00 -bash
44371 pts/7    T      1:12  \_ bash -x command-test2.run
44379 pts/7    Z      0:00  |   \_ [tee] <defunct>
44380 pts/7    Z      0:00  |   \_ [tee] <defunct>
44384 pts/7    Z      0:00  |   \_ [timeout] <defunct>
44386 pts/7    R+     0:00  \_ ps f

I tried running command-test1.run with the temp pipes being directed to /tmp just in case but the result was the same as using /dev/shm

pditommaso commented 3 years ago

Therefore both of them still hang, right?

matthew-valentine commented 3 years ago

Sadly yes.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lskatz commented 3 years ago

Hi, I came here from the Internet and I think I have a similar issue. My processes hang indefinitely but only some of them and only shovill which uses tee. Was any solution found?

nextflow-io / nextflow

Nextflow pipeline on SGE with singularity hangs indefinitely #1760