Open luoyuqi-lab opened 1 year ago
OK, just make sure the mounting will not fail again
Well, so has the suspicious task been killed or not? You don't wanna leave trash processes in the system.
Can #107 solve this problem? I guess even if you don't check servers, after a number of days, the server reboots, and the problem should get away.
Or do you like to make a script that whenever a server goes drained, the script runs your scontrol command?
In that case, the drained state seems useless to us. Can slurm completely ignore the nuisance drain state?
Rather than the vein of addition, I would perfer substraction. I mean instead of adding a script to solve the problem, can we compile a version of slurm that doesn't have the stupid drain status?
Message ID: @.***>
Before digging into this, remember to pull off high priority ones.
Check logs in log in node aha (/var/log/slurmctld.log
) and compute node dagobah (/var/log/slurmd.log
). As mentioned before, in compute node, the kill task fails to cause it to become to drain
. Task stuck by oom-kill.
We found in new resumed dagobah, now no tasks, but the Swp is full, and so many things with Status 'D' (Uninterruptible sleep) occupied the VIRT. We even can not kill them manually.
We speculate wrong/improper codes from users cause some tasks cannot be killed, then the slurm stuck by drain
status.
Solution: (1)Manually resume a drain node. (2)Write a script, auto-resume a drain node. (3) Regular reboot nodes to kill some uninterruptible sleep threads. (4)Users stop giving tasks with code bugs/flaws to the cluster.
BTW, why conda list --json
makes the node got stuck...... :cold_sweat:
Well, interesting findings.
Perhaps send out an email telling users not to run the conda command which has been run by a quite number of users. I don't even know how they found this peculiar command.
Also, ask them to check bugs in their python code.
Do it right now, so Dagobah doesn't go drain again. What do you think?
Well, interesting findings.
Perhaps send out an email telling users not to run the conda command which has been run by a quite number of users. I don't even know how they found this peculiar command.
Also, ask them to check bugs in their python code.
Do it right now, so Dagobah doesn't go drain again. What do you think?
I agree, but first I will find other commands that will stuck in the memory, I think not only conda list --json
.
When I did regularly check, I found hoth became to drain and could not be used.
Use
sinfo -R
to check, got 'Kill task failed'I check the log, in
/var/log/slurmctld.log
I‘m suspicious of hoth, because a few days ago, suddenly shared path mount failed also occur in hoth. Maybe there are some flaws when we rebuilt it a few months ago.