Open o0mahan0o opened 1 year ago
For all admins, Before digging into this, remember to pull off high priority ones.
我看到了你PD的任务。
有两个是因为你申请使用了hoth,但是hoth的GPU已经全被占用了。 另外一个是因为dagobah的状态时drng,不会接受新的任务了。
最近服务器的可用性总是不能保证,coruscant和eureka也已经长期不上线了。
Recently, hoth and dagobah frequently became 'drain' status... over 5 times. I don't know why, but in the last year maybe they just 'drain' 1 or 2 times. Maybe regular reboot script can solve it, I guess.
I recommend to do the following.
It looks like the node dagobah sometimes converts its status from mix to drng. The case is that no matter whether resources are available or not, nodes cannot maintain long-term operation.
Where it happened
The problem happened on node ada for submitting a task.
What Happened
The dagobah can not be used.
check status
![image](https://user-images.githubusercontent.com/39305294/230545600-82096c53-1835-4aa5-ac49-c966ee59329a.png)
sinfoF
What I've Tried
restore tentatively (requires administrator privilege)![image](https://user-images.githubusercontent.com/39305294/230545632-96e113c4-4bdc-4ab8-96b7-ba487faa309e.png)
sudo scontrol update nodename=dagobah state=resume