Open MartinHilgeman opened 4 years ago
The failure comes from the driver memory registration
pinned memory limit ?
ulimit -l is set to 'unlimited'.
is there any mlx5 error in dmesg from same time the failure happens in the application? looks like get_user_pages() failed for some reason. Can you please attach the full output?
This is in dmesg:
[12515.725004] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.730707] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.730710] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.730712] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.736172] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.736174] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.736177] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.741991] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.741994] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.741996] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.747510] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.747513] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.747515] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.753309] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.753312] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.753314] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
I am running latest firmware at the moment:
CA 'mlx5_2'
CA type: MT4123
Number of ports: 1
Firmware version: 20.26.1040
Hardware version: 0
Node GUID: 0xb8599f030024a7e0
System image GUID: 0xb8599f030024a7e0
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0xb8599f030024a7e0
Link layer: InfiniBand
@MartinHilgeman this seems like FW issue
@MartinHilgeman Did you find a solution? Does the problem not occur if you disable HT?
I see similar page reclaim errors on a ConnectX-6 IB-HCA on a AMD EPYC 7402 (unrelated to UCX though!).
@knweiss Yes, my issue was fixed by enabling IOMMU and setting iommu=pt on the kernel boot line.
enabling IOMU and then setting iommu=pt means.... all transactions go through the IOMMU check, but are just passed through, as if no IOMMU was enabled.... this is just adding some minor delay, which must be hiding a race condition, or .... the bios is not providing the proper reserved-region information to the OS, and some memory region is being used by the kernel & being stomped on by the system fw (like a system temp, fan monitoring fw utility). What version of linux are you running? which mlx driver -- ofed, the one that comes w/your kernel? etc.
I am using UCX 1.5.1 with MLNX HDR-200. When I enabled SMT on our AMD EPYC 7742 nodes, my GROMACS job crashes right after startup:
% module load openmpi/intel19/4.0.1 % mpirun --bind-to none --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^vader,tcp,openib,uct --mca MCA_coll_hcoll_enable 0 -x UCX_NET_DEVICES=mlx5_2:1 -x UCX_TLS=sm,rc_x gmx_mpi mdrun -s bench.tpr ...
... [1572531563.965616] [daytona20:41534:0] mpool.c:176 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error [1572531563.994840] [daytona19:41611:0] ib_md.c:362 UCX ERROR ibv_reg_mr(address=0x2aaac991b000, length=19034112, access=0xf) failed: Bad address [1572531563.995077] [daytona19:41611:0] mpool.c:176 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error [1572531564.002548] [daytona19:41516:0] ib_md.c:362 UCX ERROR ibv_reg_mr(address=0x2aaac991b000, length=19034112, access=0xf) failed: Bad address [1572531564.002777] [daytona19:41516:0] mpool.c:176 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
regards,
-Martin