lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

Node dagobah convert the status from mix to drng. #168

Open o0mahan0o opened 1 year ago

o0mahan0o commented 1 year ago

It looks like the node dagobah sometimes converts its status from mix to drng. The case is that no matter whether resources are available or not, nodes cannot maintain long-term operation.

Where it happened

The problem happened on node ada for submitting a task.

What Happened

The dagobah can not be used. image check status sinfoF image

What I've Tried

restore tentatively (requires administrator privilege) sudo scontrol update nodename=dagobah state=resume image

gqqnbig commented 1 year ago

166 I bet my five cents that other nodes will have their own dedicated issues of being drained.

gqqnbig commented 1 year ago

For all admins, Before digging into this, remember to pull off high priority ones.

Lu-233 commented 1 year ago

我看到了你PD的任务。

有两个是因为你申请使用了hoth,但是hoth的GPU已经全被占用了。 另外一个是因为dagobah的状态时drng,不会接受新的任务了。

Lu-233 commented 1 year ago

最近服务器的可用性总是不能保证,coruscant和eureka也已经长期不上线了。

luoyuqi-lab commented 1 year ago

Recently, hoth and dagobah frequently became 'drain' status... over 5 times. I don't know why, but in the last year maybe they just 'drain' 1 or 2 times. Maybe regular reboot script can solve it, I guess.

gqqnbig commented 1 year ago

I recommend to do the following.

  1. From system log, check who/which job brings a drain status.
  2. Get the program from the person. Run it, and confirm it can bring the drain status.
  3. Create a Github action that runs the cultprit program. The test result should be ❌ .
  4. Figure out a solution.
  5. Commit your solution. The test result should be ✅ .