openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

UCX ERROR Failed to allocate memory pool with SMT enabled #4365

Open MartinHilgeman opened 4 years ago

MartinHilgeman commented 4 years ago

I am using UCX 1.5.1 with MLNX HDR-200. When I enabled SMT on our AMD EPYC 7742 nodes, my GROMACS job crashes right after startup:

% module load openmpi/intel19/4.0.1 % mpirun --bind-to none --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^vader,tcp,openib,uct --mca MCA_coll_hcoll_enable 0 -x UCX_NET_DEVICES=mlx5_2:1 -x UCX_TLS=sm,rc_x gmx_mpi mdrun -s bench.tpr ...

... [1572531563.965616] [daytona20:41534:0] mpool.c:176 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error [1572531563.994840] [daytona19:41611:0] ib_md.c:362 UCX ERROR ibv_reg_mr(address=0x2aaac991b000, length=19034112, access=0xf) failed: Bad address [1572531563.995077] [daytona19:41611:0] mpool.c:176 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error [1572531564.002548] [daytona19:41516:0] ib_md.c:362 UCX ERROR ibv_reg_mr(address=0x2aaac991b000, length=19034112, access=0xf) failed: Bad address [1572531564.002777] [daytona19:41516:0] mpool.c:176 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error

regards,

-Martin

yosefe commented 4 years ago

The failure comes from the driver memory registration

shamisp commented 4 years ago

pinned memory limit ?

MartinHilgeman commented 4 years ago

ulimit -l is set to 'unlimited'.

yosefe commented 4 years ago

is there any mlx5 error in dmesg from same time the failure happens in the application? looks like get_user_pages() failed for some reason. Can you please attach the full output?

MartinHilgeman commented 4 years ago

This is in dmesg:

[12515.725004] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.730707] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.730710] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.730712] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.736172] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.736174] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.736177] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.741991] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.741994] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.741996] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.747510] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.747513] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.747515] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
[12515.753309] mlx5_core 0000:c1:00.0: mlx5_cmd_check:771:(pid 63706): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
[12515.753312] mlx5_core 0000:c1:00.0: reclaim_pages:403:(pid 63706): failed reclaiming pages: err -5
[12515.753314] mlx5_core 0000:c1:00.0: pages_work_handler:468:(pid 63706): reclaim fail -5
MartinHilgeman commented 4 years ago

I am running latest firmware at the moment:

CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.26.1040
        Hardware version: 0
        Node GUID: 0xb8599f030024a7e0
        System image GUID: 0xb8599f030024a7e0
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 3
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0xb8599f030024a7e0
                Link layer: InfiniBand
yosefe commented 4 years ago

@MartinHilgeman this seems like FW issue

knweiss commented 4 years ago

@MartinHilgeman Did you find a solution? Does the problem not occur if you disable HT?

I see similar page reclaim errors on a ConnectX-6 IB-HCA on a AMD EPYC 7402 (unrelated to UCX though!).

MartinHilgeman commented 4 years ago

@knweiss Yes, my issue was fixed by enabling IOMMU and setting iommu=pt on the kernel boot line.

ddutile commented 4 years ago

enabling IOMU and then setting iommu=pt means.... all transactions go through the IOMMU check, but are just passed through, as if no IOMMU was enabled.... this is just adding some minor delay, which must be hiding a race condition, or .... the bios is not providing the proper reserved-region information to the OS, and some memory region is being used by the kernel & being stomped on by the system fw (like a system temp, fan monitoring fw utility). What version of linux are you running? which mlx driver -- ofed, the one that comes w/your kernel? etc.