meetU-MasterStudents / 2019---2020-partage

For exchanging material and doc
2 stars 3 forks source link

Job cancel due to time limit on IFB cluster #22

Open florianecoulmance opened 4 years ago

florianecoulmance commented 4 years ago

Hello everyone,

I got my psiblast job cancelled due to time limit with the following error :

slurmstepd: error: JOB 3162130 ON cpu-node-7 CANCELLED AT 2019-11-08T15:32:30 DUE TO TIME LIMIT

Because I am doing PSIBLAST against Uniref50 it will take about more than 10 days to run.

What is the time limit ?

Can I change it with this / Do i have to specify it in order for it to work:

SBATCH --time=20-24:00:00 # days-hh:mm:ss

Bon weekend, Floriane

emorice commented 4 years ago

Hi, did you try to submit it in the long queue/partition with -p long ? If I read this correctly :

[emorice@clust-slurm-client ~]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
fast*          up 1-00:00:00      3    mix cpu-node-[9,13-14]
fast*          up 1-00:00:00     54   idle cpu-node-[6-8,10-12,15-62]
long           up 30-00:00:0      2    mix cpu-node-[13-14]
long           up 30-00:00:0     19   idle cpu-node-[10-12,15-30]
bigmem         up 60-00:00:0      1    mix cpu-node-69
training       up 30-00:00:0      5   idle cpu-node-[1-5]
maintenance    up 30-00:00:0     13  drain cpu-node-[70-74,76-83]
maintenance    up 30-00:00:0      1   idle cpu-node-75

default partition is fast with a default limit of 1 day while long has a limit of 30 (I am not familiar of slurm nor have tested it yet, this just what I understand)

Also, I believe the purpose of --time is to force a shorter time limit than the queue/partition default (i.e one wants to run a job of unknown length but have it killed if does not finish in, say, one hour) but does not allow a longer one.

florianecoulmance commented 4 years ago

It did not work but I found a solution :

SBATCH --partition=long

I put this line in the header of my script, so I guess now it is up to 30 days :)

Thank you,

Floriane

elolaine commented 4 years ago

10 days for a psiblast?! That's sounds a bit crazy... is it because you launch all queries, one after the other? Alternatively you could launch N jobs in parallel for N queries...?

florianecoulmance commented 4 years ago

Can I run 400 jobs at the same time on the cluster ? 1 query against uniref50 takes 40min to run for psiblast

elolaine commented 4 years ago

Well, probably the 400 jobs are not going to run all at the same time... But you can submit your 400 jobs independently to the cluster queue, and at least some of them will run in parallel. The interest of using the cluster is to be able to run jobs on several CPUs at the same time! (I'll ask the IFB support team if there's a cleverer way to submit the 400 queries)

florianecoulmance commented 4 years ago

Great, thank you !

I found some advices on the internet, but let me know what the IFB support team advices, I do not want to break the cluster....

elolaine commented 4 years ago

Ok, so, in case you have any question regarding the usage of the cluster, you can post it here: https://community.cluster.france-bioinformatique.fr.

For this particular problem, you should try and use Slurm's job array mode. You can find the full documentation about it here: https://slurm.schedmd.com/job_array.html.

Here is an example of a job array launching 30 fastqc on 30 different sequences : https://ifb-elixirfr.gitlab.io/cluster/trainings/slurm/ebai2019.html#56. This should be pretty similar to what you want to do.

When sbatch sees the "array" option, it launches the job as many times as values in the indicated array (for instance from 0 to 400). In each job, a variable with the list of the files to be analyzed is loaded. Then the treatment is launched on one of the file using the environment variable $SLURM_ARRAY_TASK_ID which takes as value the index of the current job (0, 1, 2, 3 etc.until 400).