stackhpc / ansible-slurm-appliance

A Slurm-based HPC workload management environment, driven by Ansible.
51 stars 25 forks source link

LTS OFED 23.10 doesn't install rdma-core-devel from MOFED repos #461

Open sjpb opened 1 month ago

sjpb commented 1 month ago

Older builds used OFED 24.04. This included the rdma-core-devel package from Mellanox.

https://github.com/stackhpc/ansible-slurm-appliance/pull/427 changed OFED to the LTS version 23.10, now that was supported for RL9. However this install uses rdma-core-devel from appstream, which doesn't feel right:

[rocky@rl9-login-0 ~]$ cat /var/lib/image/image.json 
{
    "branch": "fix/packer-sentinel-file",
    "build": "openhpc-rl9-241022-0038-a5affa58",
    "cuda": "-",
    "kernel": "5.14.0-427.40.1.el9_4.x86_64",
    "ofed": "23.10",
    "os": "Rocky 9.4",
    "slurm-ohpc": "23.11.6"
}
[root@rl9-login-0 rocky]# dnf list --installed rdma*
Installed Packages
rdma-core.x86_64                                                                                      2307mlnx47-1.2310322                                                                                 @System   
rdma-core-devel.x86_64                                                                                48.0-1.el9                                                                                           @appstream

Furthermore, adding the undocumented OFED repos for 23.10 shows there is a Mellanox rdma-core-devel package :-(

sjpb commented 1 month ago

Note that on the client build, installing lustre via something similar to #447 removed the rdma-core-devel package entirely.

sjpb commented 1 month ago

So turns out that our "nightly" build which installs OFED does install the Mellanox rdma-core-devel package, but during fatimage build, installing OHPC packages replaces it with the @appstream one.