ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

prov/ucx: fi_rdm_tagged_peek cleanup race condition #10126

Open zachdworkin opened 3 months ago

zachdworkin commented 3 months ago

Describe the bug fi_rdm_tagged_peek has a race condition cleanup error where the process segmentation faults when trying to close the endpoint

To Reproduce Build with UCX server_cmd: fi_rdm_tagged_peek -p ucx -E client_cmd: fi_rdm_tagged_peek -p ucx -E server_address

Expected behavior Test passes successfuly

Output Server output: server_cmd: /path_to_fabtests_install/fi_rdm_tagged_peek -p "ucx" -E server_stdout: | Sending 10 tagged messages Waiting for messages to complete [node:3176869:0:3176869] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8) ==== backtrace (tid:3176869) ==== 0 0x0000000000012cf0 __funlockfile() :0 1 0x0000000000033210 ucp_ep_destroy_base() ???:0 2 0x000000000004b3ee ucp_worker_discard_uct_ep_progress() ???:0 3 0x000000000004b4b5 ucp_worker_destroy() ???:0 4 0x00000000000ca7fa ucx_ep_close() ucx_ep.c:0 5 0x0000000000404081 fi_close() /path_to_libfabric_install/include/rdma/fabric.h:632 6 0x0000000000404081 ft_close_fids() /path_to_libfabric_source/fabtests/common/shared.c:1792 7 0x0000000000404b6a ft_free_res() /path_to_libfabric_source/fabtests/common/shared.c:1862 8 0x0000000000401bfa main() /hpath_to_libfabric_source/fabtests/functional/rdm_tagged_peek.c:364 9 0x0000000000401bfa main() /path_to_libfabric_source/fabtests/functional/rdm_tagged_peek.c:365 10 0x000000000003ad85 __libc_start_main() ???:0 11 0x000000000040203e _start() ???:0

Client output: client_cmd: /path_to_fabtests_install/fi_rdm_tagged_peek -p "ucx" -E server_address client_stdout: | Peek for a bad msg Peek w/ claim for a bad msg Peek msg 1 Receive msg 1 Peek w/ claim msg 2 Receive claimed msg 2 Peek & discard msg 3 Checking to see if msg 3 was discarded Peek w/ claim msg 4 Claim and discard msg 4 Receive msg 5 Receive msg 6 Receive msg 10 Receive msg 9 Receive msg 8 Receive msg 7

Environment: rocky 8.7

Additional context Fails as a race condition. No known 100% fail case.

zachdworkin commented 3 months ago

Revert #10124's ucx test disable when this is resolved.