Closed TomHarrop closed 2 years ago
@TomHarrop Thanks for the thorough bug report. I haven't observed this behavior before, and I agree it is puzzling. As you demonstrated, the call to sacct
clearly returns PENDING
, so status-sacct.sh
should be functioning.
Some questions to help generate ideas for further troubleshooting:
porechop
run successfully in the past, and just start failing recently? Or is this a new rule?Some tests to try:
cluster-status
field in config.yaml
and re-submit. Snakemake alone should be able to handle jobs with status PENDING
. The purpose of the cluster status script is to catch rarer statuses like timeoutstouch {output}
)Thanks for the quick reply.
I commented out cluster-status: status-sacct.sh
and now the problem is gone. I think I can live with that. I do get timeouts on this cluster but now I will know to expect snakemake to do its "missing output" thing.
Just in case it's still useful:
Snakemake is version 6.15.5, freshly installed in a venv via pip install
. I assume the reason for the older version of snakemake is that the cluster (RHEL 7.9) has python 3.6.8.
shell version:
$ /usr/bin/env bash --version
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
I'm running the porechop rule on >5000 different files. It failed for around 10 of them. They completed successfully on re-runs, so now I'm wondering if they actually failed, or if snakemake thought it failed and deleted the output.
I commented out cluster-status: status-sacct.sh and now the problem is gone
Glad it's working now!
I'm running the porechop rule on >5000 different files. It failed for around 10 of them. They completed successfully on re-runs, so now I'm wondering if they actually failed, or if snakemake thought it failed and deleted the output.
I'm still confused what caused the problem. I don't see how status-sacct.sh
could be the issue. If you have time to continue troubleshooting, I'd be curious to know:
status-sacct.sh
again? Would it work again now that this anomalous situation has been resolved?PENDING
due to large resource requests?If possible, I'd like to make status-sacct.sh
more robust, but it looks to be working fine. The problem seems to be more in the communication between Snakemake itself and status-sacct.sh
. In other words, why does Snakemake think the job has failed when the script returned PENDING
?
Thanks, I'll re-enable status-sacct.sh
and try again. The queue on the cluster is long right now, it will be a few days for the workflow to finish so I can submit it again.
On my current run (without status-sacct.sh
) the same jobs sat in the queue for hours before they ran, which suggests Snakemake handles the PENDING
status OK.
I haven't tried the large resource request yet (because of the long queue) but I will, to see if PENDING
is the issue. Porechop jobs were only requesting 1 core for 10 minutes but they were still waiting in the queue because it's so busy.
BTW porechop wasn't the only rule that I saw this with, so I don't think it's necessarily a problem with that rule.
Closing this. @TomHarrop please feel free to follow up with more details if you learn more
Thanks @jdblischak. I tried to reproduce it but it didn't happen again. It doesn't help that it takes a couple of days to get my jobs through the queue, but I'm putting it down to cluster weirdness for now.
Hi,
Thank you for designing this profile! It works beautifully almost all the time. Recently, I encountered a similar situation as Tom mentioned here, so I added echo ${output} >> $HOME/smk.log
in status-sacct.sh
to capture what the script actually gets. It turned out that in my environment sometimes sacct
will return an empty string (the below is from part of smk.log
):
Mon Aug 8 14:16:16 EDT 2022
Mon Aug 8 14:16:28 EDT 2022
PENDING
When this happens, snakemake
submit another job (the 14:16:16 and 14:16:28 are two separate jobs). This is reproducible in my environment: Every time my job was treated as failed by snakemake
when it is still in the queue (visible with squeue
), there would be an empty entry in the log, but I couldn't figure out how to trigger it. I suspect it is something that only transiently appears when slurm
allocates things.
Assuming it's transient and thus benign to wait until the next status check. The modification below is sufficient to prevent snakemake
from queuing another job at least for now. I would follow up if this modification causes any unexpected side effect.
if [[ $output =~ ^(COMPLETED).* ]]
then
echo success
### Wait for next check if the status is empty ###
elif [[ $output == "" ]]
then
echo running
######################################
elif [[ $output =~ ^(RUNNING|PENDING|COMPLETING|CONFIGURING|SUSPENDED).* ]]
then
echo running
else
echo failed
fi
@chenyenchung Thanks for following up and proposing a workaround!
Assuming it's transient and thus benign to wait until the next status check. The modification below is sufficient to prevent snakemake from queuing another job at least for now. I would follow up if this modification causes any unexpected side effect.
Please report back after you've used this to submit your pipelines at least a few times.
Note that the official Slurm profile purposefully runs sacct
for multiple attempts. To make this script more robust, we could do something like the following:
function get_status(){
sacct -j "$1" --format State --noheader | head -n 1 | awk '{print $1}'
}
for i in {1..3}
do
output=`get_status "$jobid"`
if [[ ! -z $output ]]
then
break
fi
done
if [[ -z $output ]]
then
echo sacct failed to return the status for jobid "$jobid" >&2
echo Maybe you need to use scontrol instead? >&2
exit 1
fi
@chenyenchung I created a new status script status-sacct-robust.sh
. Could you please try it out on your setup to test if it is more robust?
@jdblischak Thank you for the timely update! I like your approach to avoid the status check stuck in the void if it always receives a null when doing status check. I am switching to status-sacct-robust.sh
now and will come back to report how it works out later.
Regarding my naive workaround that just treats a null status as running, I happen to be running some manually parallel jobs these couple of days (~1k jobs consisting 3 different types of job -- an R script that runs Stan, a simple shell script that used no other tools other than awk
, and one that uses samtools
and bcftools
), and among the 1k jobs spanning the past 24 hrs, there were 3 occasions of status-sacct.sh
received a null string, and they were pretty close in time (within an hor) on the same job (the Stan script). However, it doesn't seem to be intrinsic to the script as the same script ran without issue later, so it still seems to me like the scheduler doing weird random things.
No above mentioned jobs were stuck with the workaround, but since there was only 3 occasions of null status, this is not really representative. Anyway, I'll keep a log and see if the current version encounters anything.
Update (2022-08-15): The robust version of script is working perfectly smooth for me so far. Thanks again!
Just posting in case anyone else looks at this issue for help. I was still having the problem, very occasionally, even with the status-sacct-robust.sh
script.
Reducing the values in jobs
, max-jobs-per-second
and max-status-checks-per-second
fixed it for me. I think it's just caused by my (busy) cluster not being able to keep up with sacct
checks.
For the record, even with the lowest number max-status-checks-per-second
I still get the issue on a busy cluster. I'm now skipping the check as suggested by @chenyenchung. It doesn't seem to cause any problems.
if [[ -z $output ]]
then
echo sacct failed to return the status for jobid "$jobid" >&2
echo Ignoring this check >&2
echo running
exit 0
fi
I've been using this a bit, e.g. in a recent workflow with 918 jobs I saw 32 failed checks but it didn't stop Snakemake finishing.
I've been using this a bit, e.g. in a recent workflow with 918 jobs I saw 32 failed checks but it didn't stop Snakemake finishing.
As long as the jobs continue to complete, then I agree that this approach is a good idea. To summarize for myself and others, update the status script to execute echo running
in the rare case that sacct
fails to return the status
Hi,
Thanks for this wrapper, it's been very useful for me.
I'm having a slight issue with jobs that fail. When a job fails, and I try to re-run the whole workflow, snakemake re-submits the job but then seems to think it failed immediately even though it's sitting in the SLURM queue.
When I run the sacct command from
status-sacct.sh manually
, I get PENDING, e.g.but Snakemake says
It's really just sitting in the SLURM queue:
There is no log output yet so I can't troubleshoot the job itself.
I've tried removing the
.snakemake
directory, as well as thelogs
directory.I'm a bit stumped so I'd appreciate any troubleshooting suggestions.
Thanks again for this useful wrapper!