openhpc / ohpc

OpenHPC Integration, Packaging, and Test Repo
http://openhpc.community
Apache License 2.0
852 stars 185 forks source link

During installation of ohpc , when trying to start #pdsh -w c[1] systemctl start , it give error #1918

Open farrukhndm opened 8 months ago

farrukhndm commented 8 months ago

Dear Team, Can anyone guide for below error facing to start

below command run without any error

# systemctl enable munge   
# systemctl enable slurmctld
# systemctl start munge
# systemctl start slurmctld
# systemctl restart php-fpm
# pdsh -w c[1] systemctl start slurmd

below command give error as below , any one hlep to guide me

# pdsh -w c[1] systemctl start munge

]0;root@master:~[root@master ~]# pdsh -w c[1] systemctl start munge 
c1: Job for munge.service failed because the control process exited with error code.
c1: See "systemctl status munge.service" and "journalctl -xe" for details.
pdsh@master: c1: ssh exited with exit code 1
adrianreber commented 8 months ago

@farrukhndm You need to check the error messages on the compute node. Please run systemctl status munge.service or journalctl -xe on c1.

martin-g commented 8 months ago

It might be a copy/paste thingy but pdsh -w c[1] systemctl start slurmd should really be pdsh -w $c[1] systemctl start slurmd, note the $ in $c[1]

farrukhndm commented 8 months ago

It might be a copy/paste thingy but pdsh -w c[1] systemctl start slurmd should really be pdsh -w $c[1] systemctl start slurmd, note the $ in $c[1]

This time i again try to run with c[1] without $ , and it run without any error , means is it Ok ? or any verification at c1 ? Further [root@master ~]# pdsh -w c[1] systemctl start slurmd [root@master ~]#

farrukhndm commented 8 months ago

@farrukhndm You need to check the error messages on the compute node. Please run systemctl status munge.service or journalctl -xe on c1.

here is ouput 1 [root@c1 ~]# systemctl status munge.service ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor prese> Active: failed (Result: exit-code) since Tue 2023-12-19 21:09:12 EST; 8h ago Docs: man:munged(8) Process: 1160 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)

Dec 19 21:09:12 c1 systemd[1]: Starting MUNGE authentication service... Dec 19 21:09:12 c1 munged[1174]: munged: Error: Failed to check logfile "/var/l> Dec 19 21:09:12 c1 systemd[1]: munge.service: Control process exited, code=exit> Dec 19 21:09:12 c1 systemd[1]: munge.service: Failed with result 'exit-code'. Dec 19 21:09:12 c1 systemd[1]: Failed to start MUNGE authentication service.


[root@c1 ~]#

here is output 2 

`[root@c1 ~]# journalctl -xe
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit session-3.scope has finished starting up.
--
-- The start-up result is done.
Dec 20 06:06:03 c1 systemd[1]: Started Session 5 of user root.
-- Subject: Unit session-5.scope has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit session-5.scope has finished starting up.
--
-- The start-up result is done.
Dec 20 06:06:03 c1 systemd-logind[1141]: New session 5 of user root.
-- Subject: A new session 5 has been created for user root
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- Documentation: https://www.freedesktop.org/wiki/Software/systemd/multiseat
--
-- A new session with the ID 5 has been created for the user root.
--
-- The leading process of the session is 1546.
`
martin-g commented 8 months ago

This time i again try to run with c[1] without $ , and it run without any error , means is it Ok ? or any verification at c1 ? Further [root@master ~]# pdsh -w c[1] systemctl start slurmd [root@master ~]#

Why without again ? Your first message is without. Now you had to try with!

martin-g commented 8 months ago

Dec 19 21:09:12 c1 munged[1174]: munged: Error: Failed to check logfile "/var/l>

I think this is the cause. The message is truncated, so it is not clear which file exactly is problematic.

farrukhndm commented 8 months ago

This time i again try to run with c[1] without $ , and it run without any error , means is it Ok ? or any verification at c1 ? Further [root@master ~]# pdsh -w c[1] systemctl start slurmd [root@master ~]#

Why without again ? Your first message is without. Now you had to try with!

Below is output with $ [root@master ~]# pdsh -w $c[1] systemctl start slurmd 1: ssh: connect to host 0.0.0.1 port 22: Invalid argument pdsh@master: 1: ssh exited with exit code 255 [root@master ~]#

> After this login on c1 node

root@c1 ~]# login as: root root@192.168.1.253's password: [root@c1 ~]# systemctl status munge.service ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor prese> Active: failed (Result: exit-code) since Wed 2023-12-20 12:40:00 EST; 4min 2> Docs: man:munged(8) Process: 1152 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)

Dec 20 12:39:59 c1 systemd[1]: Starting MUNGE authentication service... Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l> Dec 20 12:40:00 c1 systemd[1]: munge.service: Control process exited, code=exit> Dec 20 12:40:00 c1 systemd[1]: munge.service: Failed with result 'exit-code'. Dec 20 12:40:00 c1 systemd[1]: Failed to start MUNGE authentication service.

martin-g commented 8 months ago

How do you define the c array ? Do you use the input.local templates ? For example at https://github.com/openhpc/ohpc/blob/79ad004f5f7f491cf1265257b92b9e60c62ae578/docs/recipes/install/rocky9/input.local.template#L95-L98 you can see how c_ip is being defined. It seems you use something custom because the templates use c_name and c_ip as array names.

About the actual error - Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l>: It think it is caused due to permissions issue for /var/log/munge/munged.log What is the output of ls -alR /var/log/munge* ? What is the error if you try to run munged manually, i.e. without systemd ?

martin-g commented 8 months ago

Actually, I am in mistake about pdsh ! It is smart enough to deal with c[1]! So there is no need of $ ! You can focus only on the munged failure on the compute node.

martin-g commented 8 months ago

Please also try [root@master ~]# pdsh -l root -w c[1] systemctl start slurmd

farrukhndm commented 8 months ago

Actually, I am in mistake about pdsh ! It is smart enough to deal with c[1]! So there is no need of $ ! You can focus only on the munged failure on the compute node.

its ok, its worked fine without $ as below , now will check munge error & update you [root@master ~]# pdsh -w c[1] systemctl start slurmd [root@master ~]#

farrukhndm commented 8 months ago

How do you define the c array ? Do you use the input.local templates ? For example at

https://github.com/openhpc/ohpc/blob/79ad004f5f7f491cf1265257b92b9e60c62ae578/docs/recipes/install/rocky9/input.local.template#L95-L98

you can see how c_ip is being defined. It seems you use something custom because the templates use c_name and c_ip as array names. About the actual error - Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l>: It think it is caused due to permissions issue for /var/log/munge/munged.log What is the output of ls -alR /var/log/munge* ? What is the error if you try to run munged manually, i.e. without systemd ?

Here is output


[root@master ~]# ls -alR /var/log/munge*
/var/log/munge:
total 8
drwx------   2 munge munge   51 Dec 20 06:45 .
drwxr-xr-x. 21 root  root  4096 Dec 21 13:34 ..
-rw-r-----   1 munge munge    0 Dec 20 06:45 munged.log
-rw-r-----   1 munge munge 1736 Dec 20 06:45 munged.log-20231220

Still error is same on c1 root@192.168.1.253's password: Last login: Wed Dec 20 12:43:27 2023 from 192.168.1.200 [root@c1 ~]# systemctl status munge.service ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor prese> Active: failed (Result: exit-code) since Wed 2023-12-20 12:40:00 EST; 14min > Docs: man:munged(8) Process: 1152 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)

Dec 20 12:39:59 c1 systemd[1]: Starting MUNGE authentication service... Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l> Dec 20 12:40:00 c1 systemd[1]: munge.service: Control process exited, code=exit> Dec 20 12:40:00 c1 systemd[1]: munge.service: Failed with result 'exit-code'. Dec 20 12:40:00 c1 systemd[1]: Failed to start MUNGE authentication service. lines 1-11/11 (END)

github-actions[bot] commented 1 month ago

A friendly reminder that this issue had no activity for 30 days.