ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

prov/psm3: "munmap_chunk(): invalid pointer" on cleanup of fi_rdm_tagged_peek with OOB #10123

Open zachdworkin opened 3 months ago

zachdworkin commented 3 months ago

fi_rdm_tagged_peek fails to cleanup on the server side with "munmap_chunk(): invalid pointer" if FI_PROVIDER="psm3" is set.

To Reproduce server_cmd: FI_PROVIDER=psm3 fi_rdm_tagged_peek -p psm3 -E client_cmd: FI_PROVIDER=psm3 fi_rdm_tagged_peek -p psm3 -E "server_address"

Expected behavior Test passes successfully

Output Server Output: Sending 10 tagged messages Waiting for messages to complete munmap_chunk(): invalid pointer Aborted (core dumped)

Server Backtrace: gdb) bt

0 0x00007ffff6496aff in raise () from /lib64/libc.so.6

1 0x00007ffff6469ea5 in abort () from /lib64/libc.so.6

2 0x00007ffff64d9097 in __libc_message () from /lib64/libc.so.6

3 0x00007ffff64e04ec in malloc_printerr () from /lib64/libc.so.6

4 0x00007ffff64e079c in munmap_chunk () from /lib64/libc.so.6

5 0x00007ffff7a88e0f in psm3_free_internal (ptr=0x735a80, curloc=0x7ffff7b12953 "prov/psm3/psm3/psm_ep.c:1163")

at prov/psm3/psm3/psm_utils.c:3964

6 0x00007ffff7a63d41 in psm3_ep_close (ep=0x636ac0, mode=0, timeout_in=2000000000) at prov/psm3/psm3/psm_ep.c:1163

7 0x00007ffff7a29b31 in psmx3_trx_ctxt_free (trx_ctxt=0x62b3a0, usage_flags=3) at prov/psm3/src/psmx3_trx_ctxt.c:223

8 0x00007ffff7a11cea in psmx3_ep_close (fid=0x7349b0) at prov/psm3/src/psmx3_ep.c:234

9 0x0000000000403fb1 in fi_close (fid=)

at /path_to_libfabric_install/include/rdma/fabric.h:632

10 ft_close_fids () at common/shared.c:1792

11 0x0000000000404a9a in ft_free_res () at common/shared.c:1862

12 0x0000000000401b2a in main (argc=, argv=) at functional/rdm_tagged_peek.c:364

Client Output: Peek for a bad msg Peek w/ claim for a bad msg Peek msg 1 Receive msg 1 Peek w/ claim msg 2 Receive claimed msg 2 Peek & discard msg 3 Checking to see if msg 3 was discarded Peek w/ claim msg 4 Claim and discard msg 4 Receive msg 5 Receive msg 6 Receive msg 10 Receive msg 9 Receive msg 8 Receive msg 7

Environment: rocky 8.7 mlnx 5.0

Additional context Setting and unsetting FI_PROVIDER fixes this bug Specific free() call that fails is freeing the hfi_nids struct in file psm_ep.c:1163

zachdworkin commented 3 months ago

10124 disables fi_rdm_tagged_peek test from CI while this bug is investigated. Please revert this change when it is resolved.