slurm jobs stuck in CG state

tothuhien commented 2 years ago

Some slurm jobs have finished or were deleted by user but still in CG state of the slurm queue.

kjetilkl commented 2 years ago

I tried restarting slurmctld on the main node and slurmd on slurm.usegalaxy.no. That did not fix things, but sometime later the slurm.usegalaxy.no node went down for some reason, and that seems to have cleared up the CG jobs, which had all been running on this node.

kjetilkl commented 2 years ago

The slurm.usegalaxy.no node is "not responding" at the moment, however

kjellp commented 2 years ago

The slurm.usegalaxy.no host is answering to ping from usegalaxy.no (logged in as galaxyadmin).

I also tried ssh into slurm.usegalaxy.no, and the ssh prompt responded immediately, so it seems like the slurm.usegalaxy.no node is responsive now?

Are there still issues, and do we need to assistance from Stanislav that have access to server from the hypervisor side?

K.

From: Kjetil Klepper @.> Sent: Thursday, August 25, 2022 11:34 AM To: usegalaxy-no/galaxyadmin @.> Cc: Subscribed @.***> Subject: Re: [usegalaxy-no/galaxyadmin] slurm jobs stuck in CG state (Issue #74)

The slurm.usegalaxy.no node is "not responding" at the moment, however

— Reply to this email directly, view it on GitHubhttps://github.com/usegalaxy-no/galaxyadmin/issues/74#issuecomment-1227020439, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABAHO7SY5FKTT7RUFXHYNRDV244ZFANCNFSM57QUCXRQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

kjetilkl commented 2 years ago

Yes, everything seems to be in order with slurm.usegalaxy.no, but it still has an asterisk behind the node state reported by sinfo, which means that it is not responding. Assistance from someone with better Slurm skills would be nice!

sveinugu commented 2 years ago

Assistance from someone with better Slurm skills would be nice!

So who would that be? @emrobe? If I remember correctly, @nemyxun was also supposed to increase his slurm skills. How urgent is this?

nemyxun commented 2 years ago

Hello All,

@sveinugu Sveinung, I have SU access on host test.usegalaxy.no only by SSH key.

IMHO fastest but not better solution is SLURM Master host restart. But it would be nice to find the roots of problem.

THX

kjellp commented 2 years ago

Dont forget that Teshome from NMBU has joined, and is maybe able to help on slurm matters. And probably not seeing these messages!

kjetilkl commented 2 years ago

I have restored sysadmin access to the production stack for Lev, Sveinung and Hien. I had probably just hacked the _authorizedkeys file last time and forgot to do it properly via the playbooks, so it was overwritten again during summer

sveinugu commented 2 years ago

Dont forget that Teshome from NMBU has joined, and is maybe able to help on slurm matters. And probably not seeing these messages!

@teshomem?

teshomem commented 2 years ago

CG state means the node that runs the job is not responding. You can't even SSH to the node in many cases. The solution that worked form is first to restart the node and restart the slurmd on the node. The (*) indicates that the node is unreachable. This happens when the slurmd daemon in the node isn't started.

emrobe commented 2 years ago

I've just checked and restarted all services involved. This seems to be something else.

scontrol: error: slurm_slurmd_info: Connection refused

teshomem commented 2 years ago

If you are using Munge, make sure it is started. Otherwise, share the slurmctld log on the slurmctld server.

emrobe commented 2 years ago

Slurmctld is filled with these: [2022-08-26T13:36:17.521] error: Nodes ecc3.usegalaxy.no not responding [2022-08-26T13:38:01.718] error: Node ecc1.usegalaxy.no appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2022-08-26T13:38:01.718] error: Node nrec2.usegalaxy.no appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2022-08-26T13:38:01.720] error: Node ecc2.usegalaxy.no appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2022-08-26T13:41:17.003] error: Nodes ecc3.usegalaxy.no not responding

... and I am 90 % sure it stems from the fact that ecc somehow behaves in a non-deterministic fashion which screws up the various .conf files involved on the slurm nodes as it adds/removes nodes. We should make fixing/looking into this this a priority. (Or even consider whether we need this extra layer of complexity in our system)

teshomem commented 2 years ago

From the log file, i am kind of sure that the slurm.conf isnt in sync between slurm deamon and the node. That explains why the slurmd daemon cannot connect to the node. This can be solved by using a shared file in a shared file system.

usegalaxy-no / galaxyadmin

slurm jobs stuck in CG state #74