giovtorres / docker-centos7-slurm

Slurm Docker Container on CentOS 7
MIT License
87 stars 56 forks source link

sacct not producing output due to missing mysql tables #3

Closed percyfal closed 6 years ago

percyfal commented 6 years ago

I'm using the slurm container for various tests and would like to monitor the status of jobs using the sacct command. I fire up the container:

docker run -it -h ernie giovtorres/docker-centos7-slurm:latest and submit a simple job:

[root@ernie /]# sbatch --wrap "sleep 60"

Submitted batch job 2

[root@ernie /]# squeue -l               
Fri Dec  8 09:41:47 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
                 2    normal     wrap     root  RUNNING       0:08 5-00:00:00      1 c1

scontrol works fine:

[root@ernie /]# scontrol show job 2
JobId=2 JobName=wrap
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901759 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2017-12-08T09:41:39 EligibleTime=2017-12-08T09:41:39
   StartTime=2017-12-08T09:41:39 EndTime=2017-12-13T09:41:39 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2017-12-08T09:41:39
   Partition=normal AllocNode:Sid=ernie:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c1
   BatchHost=localhost
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//slurm-2.out
   StdIn=/dev/null
   StdOut=//slurm-2.out
   Power=

However, sacct fails since the table 'slurm_acct_db.linux_job_table' doesn't exist:

[root@ernie /]# sacct         
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[root@ernie /]# cat /var/log/slurm/slurmdbd.log |tail
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] error: It looks like the storage has gone away trying to reconnect
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] DBD_JOB_START: cluster not registered

I cloned the repo and modified some settings in slurm.conf, to no avail. I have little experience setting up slurm so I'm unsure what changes need to be applied.

The issue has been reported before (e.g. http://thread.gmane.org/gmane.comp.distributed.slurm.devel/6333 and https://bugs.schedmd.com/show_bug.cgi?id=1943) and one proposed solution is adding the table with sacctmgr:

sacctmgr add cluster linux

sacctmgr add account none,test Cluster=linux \
  Description="none" Organization="none"

sacctmgr add user da DefaultAccount=test

However, the first command hangs in the container.

Do you have any idea for a solution?

Cheers,

Per

giovtorres commented 6 years ago

Hi @percyfal. I'll have a look. In the mean time, I have another project that splits up the different slurm components into their own container. It contains a script to add the cluster to the database (see README) and sacct appears to be working as expected. You could give this a try.

giovtorres commented 6 years ago

Hi @percyfal , I refactored the supervisor config to get the start order right. When inside the container, I can run those commands without issue:

[root@ernie ~]# sacctmgr --immediate add cluster name=linux
 Adding Cluster(s)
  Name           = linux
[root@ernie ~]# 
[root@ernie ~]# 
[root@ernie ~]# supervisorctl restart slurmdbd
slurmdbd: stopped
slurmdbd: started
[root@ernie ~]# supervisorctl restart slurmctld
slurmctld: stopped
slurmctld: started
[root@ernie ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      5    unk c[1-5]
debug        up 5-00:00:00      5    unk c[6-10]
[root@ernie ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      5   idle c[1-5]
debug        up 5-00:00:00      5   idle c[6-10]
[root@ernie ~]#        
[root@ernie ~]# sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
     linux       127.0.0.1         6817  8192         1                                                                                           normal                  
[root@ernie ~]#     
[root@ernie ~]# sacctmgr add account none,test Cluster=linux Description="none" Organization="none"
 Adding Account(s)
  none
  test
 Settings
  Description     = none
  Organization    = none
 Associations
  A = none       C = linux     
  A = test       C = linux     
 Settings
  Parent        = root
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root@ernie ~]#      
[root@ernie ~]# sacctmgr show account 
   Account                Descr                  Org 
---------- -------------------- -------------------- 
      none                 none                 none 
      root default root account                 root 
      test                 none                 none 
[root@ernie ~]#

Could you give it a try now?

Thanks.

percyfal commented 6 years ago

Hi @giovtorres , I can confirm that it now works like a charm! Thanks!

/P