Closed ydye closed 5 years ago
Pending this issue, and still investigate why the installation will failed after upgrade OFED to the latest version. @fanyangCS @scarlett2018 @sterowang
There are some compiler errors.
CC [M] /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx5/core/debugfs.o /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c: In function 'mlx4_start_catas_poll': /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c:282:2: error: implicit declaration of function 'init_timer' [-Werror=implicit-function-declaration] init_timer(&priv->catas_err.timer); ^ /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c:298:23: error: 'struct timer_list' has no member named 'data' priv->catas_err.timer.data = (unsigned long) dev; ^ /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c:299:33: error: assignment from incompatible pointer type [-Werror=incompatible-pointer-types] priv->catas_err.timer.function = poll_catas; ^ cc1: all warnings being treated as errors scripts/Makefile.build:332: recipe for target '/tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.o' failed make[5]: [/tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.o] Error 1 make[5]: Waiting for unfinished jobs....
There are some compiler errors.
CC [M] /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx5/core/debugfs.o /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c: In function 'mlx4_start_catas_poll': /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c:282:2: error: implicit declaration of function 'init_timer' [-Werror=implicit-function-declaration] init_timer(&priv->catas_err.timer); ^ /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c:298:23: error: 'struct timer_list' has no member named 'data' priv->catas_err.timer.data = (unsigned long) dev; ^ /tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.c:299:33: error: assignment from incompatible pointer type [-Werror=incompatible-pointer-types] priv->catas_err.timer.function = poll_catas; ^ cc1: all warnings being treated as errors scripts/Makefile.build:332: recipe for target '/tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.o' failed make[5]: [/tmp/mlnx_iso.50/mlnx-ofed-kernel/mlnx-ofed-kernel-4.2/drivers/net/ethernet/mellanox/mlx4/catas.o] Error 1 make[5]: Waiting for unfinished jobs....
This due to linux kernel update, and the interface has been changed. Upgrade OFED version could solve this issue. But After upgrade the OFED version, another problem will occurs. That's what I mentioned in this issue.
Mitigate PR: #2514
shall we put the mitigation in Mid April or End April? @sterowang @ydye
@scarlett2018 @sterowang I think it should be in.
cc @Gerhut @qfyin - may we put this in Mid April? since this is only related to installation, should not have a big impact on the entire release plan.
cc @Gerhut @qfyin - may we put this in Mid April? since this is only related to installation, should not have a big impact on the entire release plan.
How much is the test effort of this fix? We have only 1 day to do the test in mid April plan.
cc @Gerhut @qfyin - may we put this in Mid April? since this is only related to installation, should not have a big impact on the entire release plan.
How much is the test effort of this fix? We have only 1 day to do the test in mid April plan.
1) Rebuild Image 2) Before deployment, configure to disable ib installation
drivers:
skip-ib-installation: true
3) If drivers is passed, then it will be ok.
@ydye - may you take the test case on Mid April test date? if so, I guess it does not have too much impact on Mid April release. cc @Gerhut @qfyin
ok
1 thing that can lower the chances to hit the problem is to by default turn off the setting. while on the otherside, we need to consider users who are already using IB when working on the design.
For Azure VM,
There're builtin InfiniBand kernel modules in vmlinux image, so DO NOT need to install Mellanox OFED.
If there's Mellanox device in lspci
, then IB device can be used directly. Otherwise, there's no IB in VM.
Corresponding config in /boot/config-$(uname -r)
:
CONFIG_MLX4_CORE=y
CONFIG_MLX5_CORE=y
Disabled IB driver (OFED) installation in #2595.
To use InfiniBand in a non-privileged Docker container, we need two flags:
--cap-add=IPC_LOCK
--device=/dev/infiniband
Added in #2657.
Important Note
A mitigation for the issue had been provided in the following PR: Mitigate PR: #2514 @ydye. please refer to the following instruction to apply the fix: https://github.com/Microsoft/pai/issues/2470#issuecomment-481125214
This issue is for a more formal resolution for the issue.
Organization Name: OpenPAI
Short summary about the issue/question:
With OpenPAI's offical image, the installation will failed due to ubuntu's kernel upgrade. Some interface has been changed, which cause the failure. After upgrade the OFED version to 4.5-1.0.1.0, the installation will failed. And the log is following.
File /tmp/MLNX_OFED_LINUX.18160.logs/mlnx-ofed-kernel-modules.debinstall.log
Output of lspci
Brief what process you are following:
How to reproduce it:
OpenPAI Environment:
uname -a
): 4.15.0-1040-azureAnything else we need to know: