stackhpc / ansible-slurm-appliance

A Slurm-based HPC workload management environment, driven by Ansible.
36 stars 15 forks source link

Use RL9 for caas environment #380

Closed sjpb closed 3 months ago

sjpb commented 3 months ago

Notes:

sjpb commented 3 months ago

Checks in Azimuth @ 05c29ce, non-manila cluster using image openhpc-RL9-240313-1057-15f9ab38

sjpb commented 3 months ago

Tested an upgrade from RL8 to RL9 worked fine:

  1. At 4ec5332 created RL8 cluster in Azimuth with manila project/home and hpctests ON:

    # is RL8:
    [azimuth@slurm-v7-login-0 ~]$ cat /etc/redhat-release 
    Rocky Linux release 8.9 (Green Obsidian)
    # is OHPCv2:
    [azimuth@slurm-v7-login-0 ~]$ grep baseurl /etc/yum.repos.d/OpenHPC.repo 
    baseurl = http://repos.openhpc.community/OpenHPC/2/CentOS_8
    baseurl = http://repos.openhpc.community/OpenHPC/2/updates/CentOS_8
    # uses manila:
    [azimuth@slurm-v7-login-0 ~]$ findmnt -t ceph -o TARGET,FSTYPE
    TARGET   FSTYPE
    /home    ceph
    /project ceph
    # show ohpc modules, ignoring unspecific
    [azimuth@slurm-v7-login-0 ~]$ module --terse spider | grep -v '/$'
    boost/1.81.0
    dimemas/5.4.2
    extrae/3.8.3
    gnu12/12.3.0
    hwloc/2.7.2
    imb/2021.3
    libfabric/1.19.0
    likwid/5.2.2
    omb/6.1
    openblas/0.3.21
    openmpi4/4.1.6
    os
    papi/6.0.0
    pdtoolkit/3.25.1
    prun/2.2
    scalasca/2.5
    scorep/7.1
    sionlib/1.7.7
    tau/2.31.1
    ucx/1.15.0
    [azimuth@slurm-v7-login-0 ~]$ module load gnu12 openmpi4
    [azimuth@slurm-v7-login-0 ~]$ gcc --version
    gcc (GCC) 12.3.0
    Copyright (C) 2022 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    [azimuth@slurm-v7-login-0 ~]$ mpirun --version
    mpirun (Open MPI) 4.1.6
    ...
  2. Patched it to RL9. Hit OOD sshkeys problem. Patched to 963f641, solved that problem. Checks:

    
    # is RL9:
    [azimuth@slurm-v7-login-0 ~]$ cat /etc/redhat-release 
    Rocky Linux release 9.3 (Blue Onyx)
    [azimuth@slurm-v7-login-0 ~]$ srun -N2 cat /etc/redhat-release
    Rocky Linux release 9.3 (Blue Onyx)
    Rocky Linux release 9.3 (Blue Onyx)

is OHPC v3:

[azimuth@slurm-v7-login-0 ~]$ grep baseurl /etc/yum.repos.d/OpenHPC.repo baseurl = http://repos.openhpc.community/OpenHPC/3/EL_9 baseurl = http://repos.openhpc.community/OpenHPC/3/updates/EL_9

uses ceph:

[azimuth@slurm-v7-login-0 ~]$ findmnt -t ceph -o TARGET,FSTYPE TARGET FSTYPE /home ceph /project ceph

check modules

[azimuth@slurm-v7-login-0 ~]$ module --terse spider | grep -v '/$' boost/1.81.0 dimemas/5.4.2 extrae/3.8.3 gnu12/12.2.0 hwloc/2.9.0 imb/2021.3 libfabric/1.18.0 likwid/5.2.2 omb/6.1 openblas/0.3.21 openmpi4/4.1.5 os papi/6.0.0 pdtoolkit/3.25.1 pmix/4.2.6 prun/2.2 scalasca/2.5 scorep/7.1 sionlib/1.7.7 tau/2.31.1 ucx/1.14.0

[azimuth@slurm-v7-login-0 ~]$ module load gnu12 openmpi4 [azimuth@slurm-v7-login-0 ~]$ gcc --version gcc (GCC) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [azimuth@slurm-v7-login-0 ~]$ mpirun --version mpirun (Open MPI) 4.1.5 ... [azimuth@slurm-v7-login-0 ~]$ slurmctld -V slurm 22.05.11 [azimuth@slurm-v7-login-0 ~]$ slurmd -V slurm 22.05.11



Also checked that the /home/hpctests/pingpong directory (including xhpl binary) from an RL8 cluster worked when copied onto the RL9 cluster
- ldd showed binary linked OK
- ran without errors
sjpb commented 3 months ago

Checked on upgrade from RL8 to RL9 that previously-run jobs (and new jobs) are shown in dashboard. Checked OOD desktop, shell, jupyter work.