rocky-linux / rocky-tools

MIT License
401 stars 139 forks source link

migration failure with MLNX_OFED_LINUX 4.9 (LTS) installed #158

Open tcooper opened 2 years ago

tcooper commented 2 years ago

System is installed with the following which are related to this issue:

This is second migration attempt of development system secondary head node after re-image following initial failed migration with manual resolution.

Previous migration was eventually completed and the system was running Rocky 8.5 with BCM 9.0-17 and MLNX_OFED_LINUX 4.9 (LTS) without issues. In attempt to confirm all issues were resolved the system was restored to pre-migration state, any previous issues (extra installed kernels and remaining rhel8u0 kmods with no matching kernel and missing deps) were resolved and migration was re-attempted.

It seems possible that the addition of --setopt=<reponame>.excludepkgs= options may resolve (may be able to investigate) and could be supported in a future version of migrate2rocky.sh.

Configuration of dnf via /etc/dnf/dnf.conf to ignore MLNX_OFED_LINUX packages may also resolve (for example)...

# dnf check -v
Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
DNF version: 4.7.0
cachedir: /var/cache/dnf
User-Agent: constructed: 'libdnf (CentOS Linux 8; generic; Linux.x86_64)'
Excludes in dnf.conf: tog-pegasus-devel mpi-selector.x86_64 mlnx-ofa_kernel kmod-mlnx-ofa_kernel mlnx-ofa_kernel-devel kmod-kernel-mft-mlnx knem kmod-knem ofed-scripts rdma-core libibverbs librdmacm libibumad infiniband-diags rdma-core-devel libibverbs-utils ibsim ibacm librdmacm-utils opensm-libs opensm opensm-devel opensm-static dapl dapl-devel dapl-devel-static dapl-utils perftest mstflint mft srp_daemon ibutils2 dump_pr ar_mgr qperf ucx ucx-devel sharp ucx-cma ucx-ib ucx-rdmacm ucx-knem hcoll openmpi mlnx-ethtool mlnx-iproute2 mlnxofed-docs libmthca-static compat-dapl-static compat-dapl-static-1.2.5 dapl-static libibverbs-rocee libibverbs-rocee-devel libibverbs-rocee-devel-static

...and this will be attempted again via manual resolution with existing Rocky repository configuration in place...

(DEV - 3HZVN23 - PASSIVE) [root@devmgr2 : migrate2rocky]# dnf repolist
repo id                                  repo name
appstream                                Rocky Linux 8 - AppStream
baseos                                   Rocky Linux 8 - BaseOS
devel                                    Rocky Linux 8 - Devel WARNING! FOR BUILDROOT AND KOJI USE
epel                                     Extra Packages for Enterprise Linux 8 - x86_64
epel-modular                             Extra Packages for Enterprise Linux Modular 8 - x86_64
extras                                   Rocky Linux 8 - Extras
powertools                               Rocky Linux 8 - PowerTools

Any additional recommendations for resolving manually would be appreciated.

migrate2rocky.log

tcooper commented 2 years ago

@pajamian With packages excluded...

# dnf -v check
Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
DNF version: 4.7.0
cachedir: /var/cache/dnf
User-Agent: constructed: 'libdnf (Rocky Linux 8.5; generic; Linux.x86_64)'
Excludes in dnf.conf: ar_mgr, cm-docker, cm-etcd, cm-kubernetes118, compat-dapl-static, compat-dapl-static-1.2.5, dapl, dapl-devel, dapl-devel-static, dapl-static, dapl-utils, dump_pr, hcoll, ibacm, ibsim, ibutils2, infiniband-diags, kmod-kernel-mft-mlnx, kmod-knem, kmod-mlnx-ofa_kernel, knem, libibumad, libibverbs, libibverbs-rocee, libibverbs-rocee-devel, libibverbs-rocee-devel-static, libibverbs-utils, libmthca-static, librdmacm, librdmacm-utils, mft, mlnx-ethtool, mlnx-iproute2, mlnx-ofa_kernel, mlnx-ofa_kernel-devel, mlnxofed-docs, mpi-selector.x86_64, mstflint, ofed-scripts, openmpi, opensm, opensm-devel, opensm-libs, opensm-static, perftest, qperf, rdma-core, rdma-core-devel, sdsc_gsi-openssh, sdsc_gsi-openssh-clients, sdsc_gsi-openssh-server, sharp, slurm20*, srp_daemon, tog-pegasus, tog-pegasus-devel, ucx, ucx-cma, ucx-devel, ucx-ib, ucx-knem, ucx-rdmacm

...it appears manual dnf distro-sync may be possible...

# dnf -y distro-sync 2>&1 | tee manual-dnf-distro-sync.log
Last metadata expiration check: 0:48:31 ago on Thu 10 Feb 2022 10:59:40 AM PST.
Dependencies resolved.
===============================================================================================================================
 Package                                  Arch       Version                                              Repository      Size
===============================================================================================================================
Installing:
 kernel                                   x86_64     4.18.0-348.12.2.el8_5                                baseos         7.0 M
 kernel-core                              x86_64     4.18.0-348.12.2.el8_5                                baseos          38 M
 kernel-devel                             x86_64     4.18.0-348.12.2.el8_5                                baseos          20 M
 kernel-modules                           x86_64     4.18.0-348.12.2.el8_5                                baseos          30 M
Upgrading:
 acl                                      x86_64     2.2.53-1.el8.1                                       baseos          80 k
 apr-util                                 x86_64     1.6.1-6.el8.1                                        appstream      104 k

 ...<snip>...

  zlib-devel-1.2.11-17.el8.x86_64
  zsh-5.5.1-6.el8_1.2.x86_64
  zstd-1.4.4-1.el8.x86_64

Complete!

With this knowlege it's likely the next migration will go a fair bit more smoothly if not automatically.

Thanks for all the work you've put into migrate2rocky.sh.

tcooper commented 2 years ago

@pajamian Final update for this issue report.

Migration went more smoothly on the primary headnode of this system with the addition of the exclusion of many/most non-{CentOS|Fedora} packages from the migration via an exclude=... entry in /etc/dnf/dnf.conf.

The excluded packages are all from local installation or install from alternate repositories that migrate2rocky.sh cannot map/manage.

After migration and reboot the repositories were re-enabled, the exclude=... entry restored to the less restrictive system default and any other packages needing update were updated (there were none in the final instance).

The same general sequence was used to migrate chroots running via systemd-nspawn although those were even easier because migrate2rocky.sh now handled vault changes for CentOS and clones of chroot enviroments could be safely migrated and, if a failure occurred, thown away.

The sequence of commands to build up this list of excludable packages was similar for the physical host and chroot and was...

chroot example

# ./migrate2rocky.sh -V

# ls -l /root/convert
total 1532
-rw-r--r-- 1 root root  101541 Feb 11 14:05 node-installer-rpm-list-begin.log
-rw-r--r-- 1 root root 1464834 Feb 11 14:05 node-installer-rpm-list-verified-begin.log

# echo "exclude=$(grep -Ev "centos|fedora" /root/convert/node-installer-rpm-list-begin.log | grep -v gpg-pubkey | column -s\| -t | awk '{print $1}' | tr '\n' ' ')" >> /etc/dnf/dnf.conf

# dnf check -v
Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
DNF version: 4.7.0
cachedir: /var/cache/dnf
User-Agent: constructed: 'libdnf (CentOS Linux 8; generic; Linux.x86_64)'
Excludes in dnf.conf: MegaCli ar_mgr dapl-devel-static dapl-devel dapl-utils dapl dump_pr elrepo-release hcoll ibacm ibsim ibutils2 infiniband-diags kmod-bnxt_en kmod-elx-lpfc kmod-isert kmod-iser kmod-kernel-mft-mlnx kmod-knem kmod-megaraid_sas kmod-mlnx-ofa_kernel kmod-rshim kmod-srp knem libibumad libibverbs-utils libibverbs librdmacm-utils librdmacm lustre-client-dkms lustre-client mft mlnx-ethtool mlnx-fw-updater mlnx-iproute2 mlnx-ofa_kernel-devel mlnx-ofa_kernel mlnxofed-docs mpi-selector mstflint ofed-scripts openmpi opensm-devel opensm-libs opensm-static opensm perftest qperf rdma-core-devel rdma-core sharp srp_daemon srvadmin-argtable2 srvadmin-hapi srvadmin-idracadm7 telegraf ucx-cma ucx-devel ucx-ib ucx-knem ucx-rdmacm ucx

# ./migrate2rocky.sh -V -r
migrate2rocky - Begin logging at Fri Feb 11 14:15:00 2022.

Creating a list of RPMs installed: begin
Verifying RPMs installed against RPM database: begin

Removing dnf cache
Preparing to migrate CentOS Linux 8 to Rocky Linux 8.

Error: Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: No URLs in mirrorlist
Baseurl for appstream is invalid, setting to https://dl.rockylinux.org/vault/centos/8.5.2111/AppStream/x86_64/os/.
Error: Failed to download metadata for repo 'baseos': Cannot prepare internal mirrorlist: No URLs in mirrorlist
Baseurl for baseos is invalid, setting to https://dl.rockylinux.org/vault/centos/8.5.2111/BaseOS/x86_64/os/.
Determining repository names for CentOS Linux 8......

Found the following repositories which map from CentOS Linux 8 to Rocky Linux 8:
CentOS Linux 8  Rocky Linux 8
appstream       appstream
baseos          baseos
extras          extras

...<snip>...

xkeyboard-config-2.28-1.el8.noarch
zip-3.0-23.el8.x86_64
zlib-1.2.11-17.el8.x86_64
Removed:
kernel-4.18.0-147.el8.x86_64             kernel-core-4.18.0-147.el8.x86_64
kernel-modules-4.18.0-147.el8.x86_64

Complete!
Creating a list of RPMs installed: finish
Verifying RPMs installed against RPM database: finish

You may review the following files:
/root/convert/node-installer-rpm-list-begin.log
/root/convert/node-installer-rpm-list-finish.log
/root/convert/node-installer-rpm-list-verified-begin.log
/root/convert/node-installer-rpm-list-verified-finish.log

Done, please reboot your system.
A log of this installation can be found at /var/log/migrate2rocky.log

In my experiece with multiple migration attempts on these systems some packages could be safely removed and no longer needed to be excluded explicitly but others could not. Still others would only trigger a failure during transaction processing of dnf distro-sync ... and, currently, migrate2rocky.sh cannot be re-run if you fail at this stage.

I'm not convinced this sequence should be generalized and added explicitly to migrate2rocky.sh but perhaps it's possible something like this could be done if/when only the -V option is specified.

This might be used to alert the user to the potential list of packages that could break dnf distro-sync ... and suggesting they do more research before attempting migrate2rocky.sh -r when it's possible (likely?) that it will fail anyway.

In short, if you feel there is anything useful in this issue that can be added to migrate2rocky.sh then by all means let's do it. I am happy to provide more details if you need.

Otherwise, it'll be fine to clone this issue and perhaps keep it in mind if others show up with similar problems. Clever folks will search the closed issues for hints and maybe stumble on this potential solution without any additional help.

Thanks again for all the work on migrate2rocky.sh.

pajamian commented 2 years ago

Well, I think running dnf check ahead of time and checking the result will help. Also it makes me think that package exclusions should be copied over from the source repo to the RockyLinux equivalents, so if there are exclude= lines in, say, appstream currently appstream gets replaced by the rockylinux appstream, and exclusions are lost. that could make the difference between a failing or passing migration at the distro-sync stage.

tcooper commented 2 years ago

Copying existing per-repository exclusions does sound like a good addition to migrate2rocky.sh and has the potential to prevent dnf reposync failures that could be avoided otherwise.