lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

ReqNodeNotAvail, UnavailableNodes:hoth #170

Closed o0mahan0o closed 1 year ago

o0mahan0o commented 1 year ago

As titled.

Where it happened

submit a slurm script on aha

What Happened

ReqNodeNotAvail, UnavailableNodes:hoth

What I've Tried

sit -w hoth -g is OK. sit -w hoth -g -p normal reports an error. For details:

  1. At the beginning, the NODELIST(REASON) shows: image
  2. After a few seconds, the NODELIST(REASON) turn to: image
o0mahan0o commented 1 year ago

An answer that might help:

We recently changed the max. time limit for our most commonly used partitions to be only 48 hours (down from 8 days), so I wasn't thinking about the smaller, less used partitions that still had longer time limits that could intersect with the maintenance window. https://bugs.schedmd.com/show_bug.cgi?id=5138

o0mahan0o commented 1 year ago

It is solved. After the waiting line, I can submit a task with a normal partition. I guess it may be busy in the normal queue but idle in the speedy queue.

o0mahan0o commented 1 year ago

In the beginning, the NODELIST(REASON) shows: Priority. It means allocating resources to tasks by priority.