lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

Hoth became ‘drain’ status because of Kill task failed #166

Open luoyuqi-lab opened 1 year ago

luoyuqi-lab commented 1 year ago

When I did regularly check, I found hoth became to drain and could not be used.

Use sinfo -R to check, got 'Kill task failed' hoth drail 2022 11 7

I check the log, in /var/log/slurmctld.log hoth kill job

I‘m suspicious of hoth, because a few days ago, suddenly shared path mount failed also occur in hoth. Maybe there are some flaws when we rebuilt it a few months ago.

luoyuqi-lab commented 1 year ago

Now hoth is normal, by sudo scontrol update nodename=hoth state=resume.

Need more check for hoth.

Reference slurmctld

gqqnbig commented 1 year ago

OK, just make sure the mounting will not fail again

gqqnbig commented 1 year ago

Well, so has the suspicious task been killed or not? You don't wanna leave trash processes in the system.

Can #107 solve this problem? I guess even if you don't check servers, after a number of days, the server reboots, and the problem should get away.

Or do you like to make a script that whenever a server goes drained, the script runs your scontrol command?

In that case, the drained state seems useless to us. Can slurm completely ignore the nuisance drain state?

Rather than the vein of addition, I would perfer substraction. I mean instead of adding a script to solve the problem, can we compile a version of slurm that doesn't have the stupid drain status?

Message ID: @.***>

gqqnbig commented 1 year ago

Before digging into this, remember to pull off high priority ones.

luoyuqi-lab commented 1 year ago

Found the reason why compute nodes always drain:

  1. Check logs in log in node aha (/var/log/slurmctld.log) and compute node dagobah (/var/log/slurmd.log). As mentioned before, in compute node, the kill task fails to cause it to become to drain. Task stuck by oom-kill. image image

  2. We found in new resumed dagobah, now no tasks, but the Swp is full, and so many things with Status 'D' (Uninterruptible sleep) occupied the VIRT. We even can not kill them manually. image image image

  3. We speculate wrong/improper codes from users cause some tasks cannot be killed, then the slurm stuck by drain status.

  4. Solution: (1)Manually resume a drain node. (2)Write a script, auto-resume a drain node. (3) Regular reboot nodes to kill some uninterruptible sleep threads. (4)Users stop giving tasks with code bugs/flaws to the cluster.

Reference: 1 2 3 4

BTW, why conda list --json makes the node got stuck...... :cold_sweat:

gqqnbig commented 1 year ago

Well, interesting findings. 

Perhaps send out an email telling users not to run the conda command which has been run by a quite number of users. I don't even know how they found this peculiar command.

Also, ask them to check bugs in their python code. 

Do it right now, so Dagobah doesn't go drain again. What do you think?

luoyuqi-lab commented 1 year ago

Well, interesting findings. 

Perhaps send out an email telling users not to run the conda command which has been run by a quite number of users. I don't even know how they found this peculiar command.

Also, ask them to check bugs in their python code. 

Do it right now, so Dagobah doesn't go drain again. What do you think?

I agree, but first I will find other commands that will stuck in the memory, I think not only conda list --json.