jdblischak / smk-simple-slurm

A simple Snakemake profile for Slurm without --cluster-config
Creative Commons Zero v1.0 Universal
120 stars 14 forks source link

Snakemake thinks the job failed while it waits in the queue #9

Closed TomHarrop closed 2 years ago

TomHarrop commented 2 years ago

Hi,

Thanks for this wrapper, it's been very useful for me.

I'm having a slight issue with jobs that fail. When a job fails, and I try to re-run the whole workflow, snakemake re-submits the job but then seems to think it failed immediately even though it's sitting in the SLURM queue.

When I run the sacct command from status-sacct.sh manually, I get PENDING, e.g.

$ sacct -j 35886319 --format State --noheader | head -n 1 | awk '{print $1}'
PENDING

but Snakemake says

Error executing rule porechop on cluster (jobid: 280, external: 35886319, jobscript: /path/to/.snakemake/tmp.rse0_q4j/snakejob.porechop.280.sh). For error details see the cluster log and the log files of the involved rule(s).
Job failed, going on with independent jobs.

It's really just sitting in the SLURM queue:

          35886319  physical smk-pore  tharrop PD   0:00      1 (Priority)

There is no log output yet so I can't troubleshoot the job itself.

I've tried removing the .snakemake directory, as well as the logs directory.

I'm a bit stumped so I'd appreciate any troubleshooting suggestions.

Thanks again for this useful wrapper!

jdblischak commented 2 years ago

@TomHarrop Thanks for the thorough bug report. I haven't observed this behavior before, and I agree it is puzzling. As you demonstrated, the call to sacct clearly returns PENDING, so status-sacct.sh should be functioning.

Some questions to help generate ideas for further troubleshooting:

  1. What version of snakemake are you using? Did you recently upgrade?
  2. Did the rule porechop run successfully in the past, and just start failing recently? Or is this a new rule?
  3. Did anything about your setup change recently? Did you switch HPC clusters? Switch your shell? etc

Some tests to try:

  1. Comment out the cluster-status field in config.yaml and re-submit. Snakemake alone should be able to handle jobs with status PENDING. The purpose of the cluster status script is to catch rarer statuses like timeouts
  2. Create a fake rule that requests a lot of resources, and thus will have status PENDING, but you know will run quickly and successfully (e.g. touch {output})
TomHarrop commented 2 years ago

Thanks for the quick reply.

I commented out cluster-status: status-sacct.sh and now the problem is gone. I think I can live with that. I do get timeouts on this cluster but now I will know to expect snakemake to do its "missing output" thing.

Just in case it's still useful:

Snakemake is version 6.15.5, freshly installed in a venv via pip install. I assume the reason for the older version of snakemake is that the cluster (RHEL 7.9) has python 3.6.8.

shell version:

$ /usr/bin/env bash --version
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

I'm running the porechop rule on >5000 different files. It failed for around 10 of them. They completed successfully on re-runs, so now I'm wondering if they actually failed, or if snakemake thought it failed and deleted the output.

jdblischak commented 2 years ago

I commented out cluster-status: status-sacct.sh and now the problem is gone

Glad it's working now!

I'm running the porechop rule on >5000 different files. It failed for around 10 of them. They completed successfully on re-runs, so now I'm wondering if they actually failed, or if snakemake thought it failed and deleted the output.

I'm still confused what caused the problem. I don't see how status-sacct.sh could be the issue. If you have time to continue troubleshooting, I'd be curious to know:

  1. What happens if you start using status-sacct.sh again? Would it work again now that this anomalous situation has been resolved?
  2. Did you try my idea of submitting a test rule that is assigned PENDING due to large resource requests?

If possible, I'd like to make status-sacct.sh more robust, but it looks to be working fine. The problem seems to be more in the communication between Snakemake itself and status-sacct.sh. In other words, why does Snakemake think the job has failed when the script returned PENDING?

TomHarrop commented 2 years ago

Thanks, I'll re-enable status-sacct.sh and try again. The queue on the cluster is long right now, it will be a few days for the workflow to finish so I can submit it again.

On my current run (without status-sacct.sh) the same jobs sat in the queue for hours before they ran, which suggests Snakemake handles the PENDING status OK.

I haven't tried the large resource request yet (because of the long queue) but I will, to see if PENDING is the issue. Porechop jobs were only requesting 1 core for 10 minutes but they were still waiting in the queue because it's so busy.

BTW porechop wasn't the only rule that I saw this with, so I don't think it's necessarily a problem with that rule.

jdblischak commented 2 years ago

Closing this. @TomHarrop please feel free to follow up with more details if you learn more

TomHarrop commented 2 years ago

Thanks @jdblischak. I tried to reproduce it but it didn't happen again. It doesn't help that it takes a couple of days to get my jobs through the queue, but I'm putting it down to cluster weirdness for now.

chenyenchung commented 1 year ago

Hi,

Thank you for designing this profile! It works beautifully almost all the time. Recently, I encountered a similar situation as Tom mentioned here, so I added echo ${output} >> $HOME/smk.log in status-sacct.sh to capture what the script actually gets. It turned out that in my environment sometimes sacct will return an empty string (the below is from part of smk.log):

Mon Aug  8 14:16:16 EDT 2022

Mon Aug  8 14:16:28 EDT 2022
PENDING

When this happens, snakemake submit another job (the 14:16:16 and 14:16:28 are two separate jobs). This is reproducible in my environment: Every time my job was treated as failed by snakemake when it is still in the queue (visible with squeue), there would be an empty entry in the log, but I couldn't figure out how to trigger it. I suspect it is something that only transiently appears when slurm allocates things.

Assuming it's transient and thus benign to wait until the next status check. The modification below is sufficient to prevent snakemake from queuing another job at least for now. I would follow up if this modification causes any unexpected side effect.

if [[ $output =~ ^(COMPLETED).* ]]
then
  echo success
### Wait for next check if the status is empty ###
elif [[ $output == "" ]]
then
  echo running
######################################
elif [[ $output =~ ^(RUNNING|PENDING|COMPLETING|CONFIGURING|SUSPENDED).* ]]
then
  echo running
else
  echo failed
fi
jdblischak commented 1 year ago

@chenyenchung Thanks for following up and proposing a workaround!

Assuming it's transient and thus benign to wait until the next status check. The modification below is sufficient to prevent snakemake from queuing another job at least for now. I would follow up if this modification causes any unexpected side effect.

Please report back after you've used this to submit your pipelines at least a few times.

Note that the official Slurm profile purposefully runs sacct for multiple attempts. To make this script more robust, we could do something like the following:

function get_status(){
  sacct -j "$1" --format State --noheader | head -n 1 | awk '{print $1}'
}

for i in {1..3}
do
  output=`get_status "$jobid"`
  if [[ ! -z $output ]]
  then
    break
  fi
done

if [[ -z $output ]]
then
  echo sacct failed to return the status for jobid "$jobid" >&2
  echo Maybe you need to use scontrol instead? >&2
  exit 1
fi
jdblischak commented 1 year ago

@chenyenchung I created a new status script status-sacct-robust.sh. Could you please try it out on your setup to test if it is more robust?

chenyenchung commented 1 year ago

@jdblischak Thank you for the timely update! I like your approach to avoid the status check stuck in the void if it always receives a null when doing status check. I am switching to status-sacct-robust.sh now and will come back to report how it works out later.

Regarding my naive workaround that just treats a null status as running, I happen to be running some manually parallel jobs these couple of days (~1k jobs consisting 3 different types of job -- an R script that runs Stan, a simple shell script that used no other tools other than awk, and one that uses samtools and bcftools), and among the 1k jobs spanning the past 24 hrs, there were 3 occasions of status-sacct.sh received a null string, and they were pretty close in time (within an hor) on the same job (the Stan script). However, it doesn't seem to be intrinsic to the script as the same script ran without issue later, so it still seems to me like the scheduler doing weird random things.

No above mentioned jobs were stuck with the workaround, but since there was only 3 occasions of null status, this is not really representative. Anyway, I'll keep a log and see if the current version encounters anything.

Update (2022-08-15): The robust version of script is working perfectly smooth for me so far. Thanks again!

TomHarrop commented 1 year ago

Just posting in case anyone else looks at this issue for help. I was still having the problem, very occasionally, even with the status-sacct-robust.sh script.

Reducing the values in jobs, max-jobs-per-second and max-status-checks-per-second fixed it for me. I think it's just caused by my (busy) cluster not being able to keep up with sacct checks.

TomHarrop commented 1 year ago

For the record, even with the lowest number max-status-checks-per-second I still get the issue on a busy cluster. I'm now skipping the check as suggested by @chenyenchung. It doesn't seem to cause any problems.

if [[ -z $output ]]
then
  echo sacct failed to return the status for jobid "$jobid" >&2
  echo Ignoring this check >&2
  echo running
  exit 0
fi

I've been using this a bit, e.g. in a recent workflow with 918 jobs I saw 32 failed checks but it didn't stop Snakemake finishing.

jdblischak commented 1 year ago

I've been using this a bit, e.g. in a recent workflow with 918 jobs I saw 32 failed checks but it didn't stop Snakemake finishing.

As long as the jobs continue to complete, then I agree that this approach is a good idea. To summarize for myself and others, update the status script to execute echo running in the rare case that sacct fails to return the status