pcchen / scopion

Scopion cluster
2 stars 0 forks source link

sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying #6

Open ShaoFuLiu opened 2 years ago

ShaoFuLiu commented 2 years ago

I found a error when I tried to submit jobs, the following picture is what I type and get: image

Then I found one of the probable causes of the error: image

I checked the current total number of running and pending jobs, the sum of them is 9987, which is close to default maximum number 10000. image

So I think maybe we should add a higher value of maximum job number in slurm.conf or just cancel some jobs. image

ShaoFuLiu commented 2 years ago

Here is the website I found: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration

aronton commented 2 years ago

I just scancel some of pending job and the number of pending job decreases from 10000 to 8822. But it seems not work.

ShaoFuLiu commented 2 years ago

We solved the problem. The reason of this issue is aronton submit a job which continuing submit other jobs, so the total number of jobs are still over the job maximum value 10000. Then aronton cancel it, and now it works!