giovtorres / slurm-docker-cluster

A Slurm cluster using docker-compose
MIT License
319 stars 188 forks source link

slurmstepd zombie process remains after running job on slurm cluster #36

Closed zhaohui714 closed 2 months ago

zhaohui714 commented 1 year ago

After placing the slurm cluster, enter the docker of slurmctld and submit and execute the job as follows.

[root@slurmctld /]# cd /data [root@slurmctld data]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 5-00:00:00 2 idle c[1-2] [root@slurmctld data]# sbatch --wrap="uptime" Submitted batch job 1 [root@slurmctld data]# ls slurm-1.out

Searching for slurm processes on host machine after job execution leaves slurmstepd zombies:

$ ps aux | grep slurm 990 43820 0.0 0.1 243368 7236 ? Ssl 20:23 0:00 /usr/sbin/slurmdbd -Dvvv 990 43981 0.2 0.2 904004 11036 ? Ssl 20:23 0:02 /usr/sbin/slurmctld -i -Dvvv root 44128 0.0 0.1 130864 5172 ? Ss 20:23 0:00 /usr/sbin/slurmd -Dvvv root 44209 0.0 0.1 131892 5348 ? Ss 20:23 0:00 /usr/sbin/slurmd -Dvvv 990 44563 0.0 0.0 21600 2728 ? S 20:24 0:00 slurmctld: slurmscriptd root 44575 0.0 0.1 8920 5492 pts/0 S+ 20:24 0:00 sudo docker exec -ti slurmctld bash root 44576 0.0 0.0 8920 884 pts/2 Ss 20:24 0:00 sudo docker exec -ti slurmctld bash root 44577 0.0 0.7 1328360 31112 pts/2 Sl+ 20:24 0:00 docker exec -ti slurmctld bash root 44659 0.0 0.0 0 0 ? Z 20:25 0:00 [slurmstepd]

giovtorres commented 2 months ago

Were you able to troubleshoot this? I was not able to replicate. Perhaps, you could get clues from the log files?

Closing for now. Please feel free to respond and or reopen the issue. Thanks.

jonsv322 commented 1 month ago

Add "init: true" to c1 and c2 in your docker-compose file will solve the issue.