microsoft / azure-linux-kernel

Patches for building an Azure-tuned Linux kernel.
Other
50 stars 25 forks source link

Cannot install Mellanox OFED driver with 4.15.0-1041-azure kernel #28

Closed abuccts closed 5 years ago

abuccts commented 5 years ago

Azure VM with 4.15.0-1041-azure kernel cannot install Mellanox OFED driver (same issue for 4.3-*, 4.4-*, 4.5-*).

Here's part of the log after executing ./mlnxofedinstall --force --kernel-only --without-dkms --without-fw-update --with-infiniband-diags --package-install-options -D2 -vv (having run mlnx_add_kernel_support.sh before to add kernel support).

Below is the list of MLNX_OFED_LINUX packages that you have chosen
(some may have been added by the installer due to package dependencies):

libibumad
libopensm
libibmad
infiniband-diags
ofed-scripts
mlnx-ofed-kernel-utils
mlnx-ofed-kernel-modules
iser-modules
isert-modules
srp-modules
mlnx-nfsrdma-modules
mlnx-rdma-rxe-modules
kernel-mft-modules
knem-modules

This program will install the MLNX_OFED_LINUX package on your machine.
Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.

Checking SW Requirements...
Running: dpkg --configure -a --force-all
Running: apt-get install -f
Removing old packages...
Installing new packages
Installing libibumad-43.1.1.MLNX20171122.0eb0969...
Running /usr/bin/dpkg -i --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/libibumad_43.1.1.MLNX20171122.0eb0969-0.1.43101_amd64.deb
Installing libopensm-5.0.0.MLNX20180219.c610c42...
Running /usr/bin/dpkg -i --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/libopensm_5.0.0.MLNX20180219.c610c42-0.1.43101_amd64.deb
Installing libibmad-1.3.13.MLNX20170511.267a441...
Running /usr/bin/dpkg -i --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/libibmad_1.3.13.MLNX20170511.267a441-0.1.43101_amd64.deb
Installing infiniband-diags-5.0.0.MLNX20180124.dfd2235...
Running /usr/bin/dpkg -i --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/infiniband-diags_5.0.0.MLNX20180124.dfd2235-0.1.43101_amd64.deb
Installing ofed-scripts-4.3...
Running /usr/bin/dpkg -i --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/ofed-scripts_4.3-OFED.4.3.1.0.1_amd64.deb
Installing mlnx-ofed-kernel-utils-4.3...
Running /usr/bin/dpkg -i --force-confnew --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/mlnx-ofed-kernel-utils_4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure_amd64.deb
Installing mlnx-ofed-kernel-modules-4.3...
Running /usr/bin/dpkg -i --force-confnew --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/mlnx-ofed-kernel-modules_4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure_all.deb

Error: mlnx-ofed-kernel-modules installation failed!
Collecting debug info...
See:
        /tmp/MLNX_OFED_LINUX.31695.logs/mlnx-ofed-kernel-modules.debinstall.log
Removing newly installed packages...

Running: /usr/sbin/ofed_uninstall.sh --force  --keep-mft

Here's part of the log file:

/usr/bin/dpkg -i --force-confnew --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/mlnx-ofed-kernel-modules_4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure_all.deb
Selecting previously unselected package mlnx-ofed-kernel-modules.
(Reading database ... 33122 files and directories currently installed.)
Preparing to unpack .../mlnx-ofed-kernel-modules_4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure_all.deb ...
D000002: maintscript_new nonexistent preinst '/var/lib/dpkg/tmp.ci/preinst'
Unpacking mlnx-ofed-kernel-modules (4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure) ...
D000002: process_archive tmp.ci script/file '.' contains dot
D000002: process_archive tmp.ci script/file '/var/lib/dpkg/tmp.ci/postinst' installed as '/var/lib/dpkg/info/mlnx-ofed-kernel-modules.postinst'
D000002: process_archive tmp.ci script/file '..' contains dot
D000002: process_archive tmp.ci script/file '/var/lib/dpkg/tmp.ci/control' is control
D000002: process_archive tmp.ci script/file '/var/lib/dpkg/tmp.ci/postrm' installed as '/var/lib/dpkg/info/mlnx-ofed-kernel-modules.postrm'
D000002: process_archive tmp.ci script/file '/var/lib/dpkg/tmp.ci/md5sums' installed as '/var/lib/dpkg/info/mlnx-ofed-kernel-modules.md5sums'
Setting up mlnx-ofed-kernel-modules (4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure) ...
D000002: fork/exec /var/lib/dpkg/info/mlnx-ofed-kernel-modules.postinst ( configure  )

---------------- START OF DEBUG INFO -------------------
Install command: ./mlnxofedinstall --force --kernel-only --without-dkms --without-fw-update --with-infiniband-diags --package-install-options -D2 -vv

Vars dump:
- ofedlogs: /tmp/MLNX_OFED_LINUX.9852.logs
- MLNX_OFED_LINUX_VERSION: 4.3-1.0.1.0
- MLNX_OFED_ARCH: x86_64
- MLNX_OFED_DISTRO: ubuntu16.04
- distro: ubuntu16.04
- arch: x86_64
- kernel: 4.15.0-1041-azure
- config: /tmp/ofed.conf
- update_firmware: 0

Setup info:

- uname -r: 4.15.0-1041-azure

- uname -m: x86_64

- lsb_release -a: No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.6 LTS
Release:        16.04
Codename:       xenial

- cat /etc/issue: Ubuntu 16.04.6 LTS \n \l

- cat /proc/version: Linux version 4.15.0-1041-azure (buildd@lcy01-amd64-013) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #45-Ubuntu SMP Fri Mar 15 14:41:00 UTC 2019

- gcc --version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609

The command /usr/bin/dpkg -i --force-confnew --force-confmiss -D2 /var/drivers/mellanox/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64/DEBS/mlnx-ofed-kernel-modules_4.3-OFED.4.3.1.0.1.1.g8509e41.kver.4.15.0-1041-azure_all.deb was executed successfully, but mlnx-ofed-kernel-modules haven't been made after that. Following commands outputs empty.

$ depmod -a
$ lsmod | grep mlnx

The issue occurs after Azure VM upgrading to 4.15.0-1041-azure kernel automatically.

abuccts commented 5 years ago

There're builtin InfiniBand kernel modules in vmlinux image, so Mellanox OFED installation does not work.

Corresponding config in /boot/config-4.15.0-1041-azure:

CONFIG_MLX4_CORE=y
CONFIG_MLX5_CORE=y