Closed sjpb closed 3 months ago
Checks in Azimuth @ 05c29ce, non-manila cluster using image openhpc-RL9-240313-1057-15f9ab38
hpctests: OK
syslogs: OK
[root@rl9-v4-control-0 rocky]# grep -rF "cgroupv2 manager" /var/log/messages
[root@rl9-v4-control-0 rocky]#
OOD shell: OK
Monitoring: OK
OOD desktop: OK
OOD jupyter: OK
Tested an upgrade from RL8 to RL9 worked fine:
At 4ec5332 created RL8 cluster in Azimuth with manila project/home and hpctests ON:
# is RL8:
[azimuth@slurm-v7-login-0 ~]$ cat /etc/redhat-release
Rocky Linux release 8.9 (Green Obsidian)
# is OHPCv2:
[azimuth@slurm-v7-login-0 ~]$ grep baseurl /etc/yum.repos.d/OpenHPC.repo
baseurl = http://repos.openhpc.community/OpenHPC/2/CentOS_8
baseurl = http://repos.openhpc.community/OpenHPC/2/updates/CentOS_8
# uses manila:
[azimuth@slurm-v7-login-0 ~]$ findmnt -t ceph -o TARGET,FSTYPE
TARGET FSTYPE
/home ceph
/project ceph
# show ohpc modules, ignoring unspecific
[azimuth@slurm-v7-login-0 ~]$ module --terse spider | grep -v '/$'
boost/1.81.0
dimemas/5.4.2
extrae/3.8.3
gnu12/12.3.0
hwloc/2.7.2
imb/2021.3
libfabric/1.19.0
likwid/5.2.2
omb/6.1
openblas/0.3.21
openmpi4/4.1.6
os
papi/6.0.0
pdtoolkit/3.25.1
prun/2.2
scalasca/2.5
scorep/7.1
sionlib/1.7.7
tau/2.31.1
ucx/1.15.0
[azimuth@slurm-v7-login-0 ~]$ module load gnu12 openmpi4
[azimuth@slurm-v7-login-0 ~]$ gcc --version
gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[azimuth@slurm-v7-login-0 ~]$ mpirun --version
mpirun (Open MPI) 4.1.6
...
Patched it to RL9. Hit OOD sshkeys problem. Patched to 963f641, solved that problem. Checks:
# is RL9:
[azimuth@slurm-v7-login-0 ~]$ cat /etc/redhat-release
Rocky Linux release 9.3 (Blue Onyx)
[azimuth@slurm-v7-login-0 ~]$ srun -N2 cat /etc/redhat-release
Rocky Linux release 9.3 (Blue Onyx)
Rocky Linux release 9.3 (Blue Onyx)
[azimuth@slurm-v7-login-0 ~]$ grep baseurl /etc/yum.repos.d/OpenHPC.repo baseurl = http://repos.openhpc.community/OpenHPC/3/EL_9 baseurl = http://repos.openhpc.community/OpenHPC/3/updates/EL_9
[azimuth@slurm-v7-login-0 ~]$ findmnt -t ceph -o TARGET,FSTYPE TARGET FSTYPE /home ceph /project ceph
[azimuth@slurm-v7-login-0 ~]$ module --terse spider | grep -v '/$' boost/1.81.0 dimemas/5.4.2 extrae/3.8.3 gnu12/12.2.0 hwloc/2.9.0 imb/2021.3 libfabric/1.18.0 likwid/5.2.2 omb/6.1 openblas/0.3.21 openmpi4/4.1.5 os papi/6.0.0 pdtoolkit/3.25.1 pmix/4.2.6 prun/2.2 scalasca/2.5 scorep/7.1 sionlib/1.7.7 tau/2.31.1 ucx/1.14.0
[azimuth@slurm-v7-login-0 ~]$ module load gnu12 openmpi4 [azimuth@slurm-v7-login-0 ~]$ gcc --version gcc (GCC) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [azimuth@slurm-v7-login-0 ~]$ mpirun --version mpirun (Open MPI) 4.1.5 ... [azimuth@slurm-v7-login-0 ~]$ slurmctld -V slurm 22.05.11 [azimuth@slurm-v7-login-0 ~]$ slurmd -V slurm 22.05.11
Also checked that the /home/hpctests/pingpong directory (including xhpl binary) from an RL8 cluster worked when copied onto the RL9 cluster
- ldd showed binary linked OK
- ran without errors
Checked on upgrade from RL8 to RL9 that previously-run jobs (and new jobs) are shown in dashboard. Checked OOD desktop, shell, jupyter work.
caas
environment.persist_hostkeys
role now enable-able in any environment.caas
environment where the OpenOndemand shell aborted after a patch which reimaged the nodes due to hostkeys changing.Notes:
Slurm version is unchanged at
22.05.11
.Some of the (default) openhpc package installs which are available via lmod in caas have changed between OpenHPC v2 and v3:
As OHPCv3 provides pmix and builds openmpi against it, the
srun
launcher can now be used again (early OpenHPC v2.x could use it with pmi2, also see https://github.com/stackhpc/ansible-slurm-appliance/issues/190):At this time hpctests has not been modified to make use of this.