Closed caicairay closed 6 years ago
Checked that icc can pass license check on cluster1 and cluster3 Checked that maui is not deleted, hopefully reconfigure is not required
[zcao@mu01 ~]$ sudo yum install torque-server.x86_64
Loaded plugins: aliases, changelog, downloadonly, kabi, presto, product-id, refresh-packagekit, security, subscription-manager, tmprepo,
: verify, versionlock
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Loading support for Red Hat kernel ABI
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package torque-server.x86_64 0:4.2.10-9.el6 will be installed
--> Processing Dependency: torque-libs = 4.2.10-9.el6 for package: torque-server-4.2.10-9.el6.x86_64
--> Processing Dependency: munge for package: torque-server-4.2.10-9.el6.x86_64
--> Processing Dependency: libtorque.so.2()(64bit) for package: torque-server-4.2.10-9.el6.x86_64
--> Processing Dependency: libmunge.so.2()(64bit) for package: torque-server-4.2.10-9.el6.x86_64
--> Running transaction check
---> Package munge.x86_64 0:0.5.10-1.el6 will be installed
---> Package munge-libs.x86_64 0:0.5.10-1.el6 will be installed
---> Package torque-libs.x86_64 0:4.2.10-9.el6 will be installed
--> Processing Dependency: torque = 4.2.10-9.el6 for package: torque-libs-4.2.10-9.el6.x86_64
--> Running transaction check
---> Package torque.x86_64 0:4.2.10-9.el6 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
============================================================================================================================================= Package Arch Version Repository Size
=============================================================================================================================================Installing:
torque-server x86_64 4.2.10-9.el6 epel 303 k
Installing for dependencies:
munge x86_64 0.5.10-1.el6 epel 111 k
munge-libs x86_64 0.5.10-1.el6 epel 32 k
torque x86_64 4.2.10-9.el6 epel 82 k
torque-libs x86_64 4.2.10-9.el6 epel 127 k
Transaction Summary
=============================================================================================================================================Install 5 Package(s)
Total download size: 655 k
Installed size: 1.5 M
Is this ok [y/N]: y
Downloading Packages:
Setting up and reading Presto delta metadata
Processing delta metadata
Package(s) data still to download: 655 k
(1/5): munge-0.5.10-1.el6.x86_64.rpm | 111 kB 00:00 (2/5): munge-libs-0.5.10-1.el6.x86_64.rpm | 32 kB 00:00 (3/5): torque-4.2.10-9.el6.x86_64.rpm | 82 kB 00:00 (4/5): torque-libs-4.2.10-9.el6.x86_64.rpm | 127 kB 00:00 (5/5): torque-server-4.2.10-9.el6.x86_64.rpm | 303 kB 00:00 ---------------------------------------------------------------------------------------------------------------------------------------------Total 1.8 MB/s | 655 kB 00:00 Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : munge-libs-0.5.10-1.el6.x86_64 1/5 Installing : munge-0.5.10-1.el6.x86_64 2/5 Installing : torque-4.2.10-9.el6.x86_64 3/5 Installing : torque-libs-4.2.10-9.el6.x86_64 4/5 Installing : torque-server-4.2.10-9.el6.x86_64 5/5 Verifying : torque-server-4.2.10-9.el6.x86_64 1/5 Verifying : munge-libs-0.5.10-1.el6.x86_64 2/5 Verifying : torque-4.2.10-9.el6.x86_64 3/5 Verifying : torque-libs-4.2.10-9.el6.x86_64 4/5 Verifying : munge-0.5.10-1.el6.x86_64 5/5
Installed:
torque-server.x86_64 0:4.2.10-9.el6
Dependency Installed:
munge.x86_64 0:0.5.10-1.el6 munge-libs.x86_64 0:0.5.10-1.el6 torque.x86_64 0:4.2.10-9.el6 torque-libs.x86_64 0:4.2.10-9.el6
Complete!
[zcao@mu01 torque]$ rpm -ql torque
/etc/rc.d/init.d/trqauthd
/etc/torque/pbs_environment
/etc/torque/server_name
/usr/sbin/trqauthd
/usr/share/doc/torque-4.2.10
/usr/share/doc/torque-4.2.10/CHANGELOG
/usr/share/doc/torque-4.2.10/PBS_License.txt
/usr/share/doc/torque-4.2.10/PBS_License_2.3.txt
/usr/share/doc/torque-4.2.10/README.Fedora
/usr/share/doc/torque-4.2.10/README.torque
/usr/share/doc/torque-4.2.10/Release_Notes
/usr/share/doc/torque-4.2.10/torque.setup
/usr/share/man/man1/pbs.1.gz
/var/lib/torque
/var/lib/torque/aux
/var/lib/torque/checkpoint
/var/lib/torque/pbs_environment
/var/lib/torque/server_name
/var/lib/torque/spool
/var/lib/torque/undelivered
[zcao@mu01 ~]$ sudo yum install torque-mom.x86_64
Loaded plugins: aliases, changelog, downloadonly, kabi, presto, product-id, refresh-packagekit, security, subscription-manager, tmprepo, verify, versionlock
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Loading support for Red Hat kernel ABI
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package torque-mom.x86_64 0:4.2.10-9.el6 will be installed
--> Processing Dependency: torque-libs = 4.2.10-9.el6 for package: torque-mom-4.2.10-9.el6.x86_64
--> Processing Dependency: munge for package: torque-mom-4.2.10-9.el6.x86_64
--> Processing Dependency: libtorque.so.2()(64bit) for package: torque-mom-4.2.10-9.el6.x86_64
--> Processing Dependency: libmunge.so.2()(64bit) for package: torque-mom-4.2.10-9.el6.x86_64
--> Running transaction check
---> Package munge.x86_64 0:0.5.10-1.el6 will be installed
---> Package munge-libs.x86_64 0:0.5.10-1.el6 will be installed
---> Package torque-libs.x86_64 0:4.2.10-9.el6 will be installed
--> Processing Dependency: torque = 4.2.10-9.el6 for package: torque-libs-4.2.10-9.el6.x86_64
--> Running transaction check
---> Package torque.x86_64 0:4.2.10-9.el6 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
==============================================================================================================================================================================================================
Package Arch Version Repository Size
==============================================================================================================================================================================================================
Installing:
torque-mom x86_64 4.2.10-9.el6 epel 253 k
Installing for dependencies:
munge x86_64 0.5.10-1.el6 epel 111 k
munge-libs x86_64 0.5.10-1.el6 epel 32 k
torque x86_64 4.2.10-9.el6 epel 82 k
torque-libs x86_64 4.2.10-9.el6 epel 127 k
Transaction Summary
==============================================================================================================================================================================================================
Install 5 Package(s)
Total download size: 605 k
Installed size: 1.4 M
Is this ok [y/N]: y
Downloading Packages:
Setting up and reading Presto delta metadata
epel/prestodelta | 2.2 kB 00:00
Processing delta metadata
Package(s) data still to download: 605 k
(1/5): munge-0.5.10-1.el6.x86_64.rpm | 111 kB 00:00
(2/5): munge-libs-0.5.10-1.el6.x86_64.rpm | 32 kB 00:00
(3/5): torque-4.2.10-9.el6.x86_64.rpm | 82 kB 00:00
(4/5): torque-libs-4.2.10-9.el6.x86_64.rpm | 127 kB 00:00
(5/5): torque-mom-4.2.10-9.el6.x86_64.rpm | 253 kB 00:00
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 831 kB/s | 605 kB 00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : munge-libs-0.5.10-1.el6.x86_64 1/5
Installing : munge-0.5.10-1.el6.x86_64 2/5
Installing : torque-libs-4.2.10-9.el6.x86_64 3/5
Installing : torque-4.2.10-9.el6.x86_64 4/5
Installing : torque-mom-4.2.10-9.el6.x86_64 5/5
Verifying : torque-mom-4.2.10-9.el6.x86_64 1/5
Verifying : torque-4.2.10-9.el6.x86_64 2/5
Verifying : torque-libs-4.2.10-9.el6.x86_64 3/5
Verifying : munge-libs-0.5.10-1.el6.x86_64 4/5
Verifying : munge-0.5.10-1.el6.x86_64 5/5
Installed:
torque-mom.x86_64 0:4.2.10-9.el6
Dependency Installed:
munge.x86_64 0:0.5.10-1.el6 munge-libs.x86_64 0:0.5.10-1.el6 torque.x86_64 0:4.2.10-9.el6 torque-libs.x86_64 0:4.2.10-9.el6
Complete!
[root@cu01 local]# rpm -ql torque-mom
/etc/rc.d/init.d/pbs_mom
/etc/torque/mom
/etc/torque/mom/config
/usr/bin/pbs_track
/usr/sbin/momctl
/usr/sbin/pbs_demux
/usr/sbin/pbs_mom
/usr/sbin/qnoded
/usr/share/man/man8/pbs_mom.8.gz
/var/lib/torque/mom_logs
/var/lib/torque/mom_priv
/var/lib/torque/mom_priv/config
/var/lib/torque/mom_priv/jobs
/var/log/torque
/var/log/torque/mom_logs
Failed to install torque-mom on cu30,39.
The reason is found to be the lacking of epel of yum
epel is installed on cu30,39. According to https://www.tecmint.com/how-to-enable-epel-repository-for-rhel-centos-6-5/
torque-mom is installed on all the computing nodes.
[zcao@mu01 ~]$ sudo pbs_server -t create
You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?y
[zcao@mu01 server_priv]$ sudo service trqauthd restart
Shutting down TORQUE Authorization Daemon: [ OK ]
Starting TORQUE Authorization Daemon: hostname: localhost
Active server name: localhost pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix [ OK ]
[zcao@mu01 server_priv]$ sudo service pbs_server restart
Shutting down TORQUE Server: munge_encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory (6)
Unable to communicate with localhost(127.0.0.1)
Communication failure.
qterm: could not connect to server '' (15009) munge executable not found, unable to authenticate
[ OK ]
pbs_server is already running.
Found Error:
Shutting down TORQUE Server: munge_encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory (6)
[zcao@mu01 usr]$ /etc/init.d/munge status
munged is stopped
Message from Kin Fai
Dear Zhuo,
I have solved this, it is a problem of permission to cause munge fail to start, need to chown all files involved in error message to be owned by munge:munge
https://github.com/dun/munge/issues/32
Also, afterwards, need to edit /etc/torque/server_name and /var/lib/torque/server_name to be the same as $HOSTNAME
https://bugzilla.redhat.com/show_bug.cgi?id=1227003
and then restart both pbs_server and trqauthd.
I think you can try continue the config.
Bests, Kin Fai
Problem Solved.
[zcao@mu01 ~]$ sudo pdsh -w cu[03-44] 'sed -ie 's/localhost/mu01/g' /var/lib/torque/mom_priv/config'
#Same operation has been done on cu01,02 manually.
[root@cu01 mom_priv]# service pbs_mom restart
Shutting down TORQUE Mom: pbs_mom already stopped [ OK ]
Starting TORQUE Mom: pbs_mom: LOG_ERROR::No such file or directory (2) in read_layout_file, Unable to read the layout file in /var/lib/torque/mom_priv/mom.layout
pbs_mom: LOG_ERROR::setup_nodeboards, Could not read layout file!
[FAILED]
Found Error:
Starting TORQUE Mom: pbs_mom: LOG_ERROR::No such file or directory (2) in read_layout_file, Unable to read the layout file in /var/lib/torque/mom_priv/mom.layout
Similar issues: https://bugzilla.redhat.com/show_bug.cgi?id=1321154 https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=LCG-ROLLOUT;bad59b20.1604
echo 'nodes=0,1' > /var/lib/torque/mom_priv/mom.layout
Problem Solved
Below are logs recording the output of service pbs_mom restart
and pbsnodes
commands.
TSCE deamon is still running on some of the nodes. Thus failed to restart.
Need to kill TSCE and restart.
[zcao@mu01 ~]$ sudo pdsh -w cu[01-44] 'service pbs_mom restart'
cu02: Shutting down TORQUE Mom: pbs_mom already stopped[ OK ]
cu02: server port = 15002, errno = 98 (Address already in use), already in use
cu02: Starting TORQUE Mom: [FAILED]
pdsh@mu01: cu02: ssh exited with exit code 1
# Same for 3-22, 24-37, 39-44
[zcao@mu01 ~]$ pbsnodes
cu01-0
state = free
np = 24
ntype = cluster
status = rectime=1534840541,varattr=,jobs=,state=free,netload=? 0,gres=,loadave=0.00,ncpus=24,physmem=134059420kb,availmem=130532752kb,totmem=134059420kb,idletime=19885,nusers=0,nsessions=0,uname=Linux cu01 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
cu02-0
state = down
np = 24
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
# cu[03-22]
# state = down
# np = 24
# ntype = cluster
# mom_service_port = 15002
# mom_manager_port = 15003
cu23-0
state = free
np = 24
ntype = cluster
status = rectime=1534840540,varattr=,jobs=,state=free,netload=? 0,gres=,loadave=0.00,ncpus=24,physmem=134059420kb,availmem=130467624kb,totmem=134059420kb,idletime=9489312,nusers=0,nsessions=0,uname=Linux cu23 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
# cu[24-37]
# state = down
# np = 24
# ntype = cluster
# mom_service_port = 15002
# mom_manager_port = 15003
cu38-0
state = free
np = 24
ntype = cluster
status = rectime=1534840542,varattr=,jobs=,state=free,netload=? 0,gres=,loadave=0.00,ncpus=24,physmem=268277148kb,availmem=254353684kb,totmem=268277148kb,idletime=9489290,nusers=0,nsessions=0,uname=Linux cu38 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
# cu[39-44]
# state = down
# np = 24 (48 for cu39)
# ntype = cluster
# mom_service_port = 15002
# mom_manager_port = 15003
Seems like anaconda3 would clean up the install directory if specified
-p /opt
List of TODOs
Installs
Configs
Items to check
Log of the incident