openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

ucp_client_server client not stop #9694

Closed YuvalAbadi closed 8 months ago

YuvalAbadi commented 9 months ago

Describe the bug

when using message size. -s . above 20. the client never ends run server got fin message, but client not return

Steps to Reproduce

Setup and versions

tvegas1 commented 8 months ago

Tested successfully master and v1.15 with:

Also tested with added -c am on both side. Do you also have the -s parameter on the server side?

YuvalAbadi commented 8 months ago

I didnt add -s on both sides.

is the server reply back to client the same message?

tvegas1 commented 8 months ago

yes once, so -s parameter is needed also on server side.

YuvalAbadi commented 8 months ago

Thanks

when i add -c am. client and server ends with a segmentation fault ./ucp_client_server -a localhost -i 10000 -s 1000 -c am ./ucp_client_server -s 1000 -c am -s 1000 client: matrix-load-load2-instance:18990:0:18990] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) ==== backtrace (tid: 18990) ==== 0 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(ucs_handle_error+0x144) [0x7fa625f9e714] 1 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(+0x30a8c) [0x7fa625f9ea8c] 2 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(+0x30d04) [0x7fa625f9ed04] 3 /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10) [0x7fa625bbbf10] 4 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucp.so.0(ucp_am_handler+0x5a) [0x7fa6262037fa] 5 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libuct.so.0(+0x22848) [0x7fa625962848] 6 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libuct.so.0(+0x24f78) [0x7fa625964f78] 7 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(ucs_event_set_wait+0xb3) [0x7fa625fa97b3] 8 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libuct.so.0(uct_tcp_iface_progress+0x7b) [0x7fa62596501b] 9 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7fa626220eb2] 10 ./ucp_client_server(+0x2702) [0x56426282a702] 11 ./ucp_client_server(+0x276a) [0x56426282a76a] 12 ./ucp_client_server(+0x2f9a) [0x56426282af9a] 13 ./ucp_client_server(+0x35d4) [0x56426282b5d4] 14 ./ucp_client_server(+0x3bea) [0x56426282bbea] 15 ./ucp_client_server(+0x3f35) [0x56426282bf35] 16 ./ucp_client_server(+0x4141) [0x56426282c141] 17 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fa625b9ec87] 18 ./ucp_client_server(+0x170a) [0x56426282970a]

Segmentation fault (core dumped)

server: Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) ==== backtrace (tid: 18987) ==== 0 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(ucs_handle_error+0x144) [0x7f2c63bba714] 1 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(+0x30a8c) [0x7f2c63bbaa8c] 2 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(+0x30d04) [0x7f2c63bbad04] 3 /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10) [0x7f2c637d7f10] 4 /lib/x86_64-linux-gnu/libc.so.6(+0x18eb00) [0x7f2c63927b00] 5 ./ucp_client_server(+0x18c0) [0x55d17b4cc8c0] 6 ./ucp_client_server(+0x2d8d) [0x55d17b4cdd8d] 7 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucp.so.0(ucp_am_handler+0x199) [0x7f2c63e1f939] 8 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libuct.so.0(+0x22848) [0x7f2c6357e848] 9 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libuct.so.0(+0x24f78) [0x7f2c63580f78] 10 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucs.so.0(ucs_event_set_wait+0xb3) [0x7f2c63bc57b3] 11 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libuct.so.0(uct_tcp_iface_progress+0x7b) [0x7f2c6358101b] 12 /home/ubuntu/yabadi/ucx/ucx-1.15.0/install/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7f2c63e3ceb2] 13 ./ucp_client_server(+0x3d0a) [0x55d17b4ced0a] 14 ./ucp_client_server(+0x3e22) [0x55d17b4cee22] 15 ./ucp_client_server(+0x4126) [0x55d17b4cf126] 16 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f2c637bac87] 17 ./ucp_client_server(+0x170a) [0x55d17b4cc70a]

Segmentation fault (core dumped)

tvegas1 commented 8 months ago

Tried similar commands below and did not see any repro. Could you please try a later version?

./examples/ucp_client_server -c am -i 10000 -s 10000
./examples/ucp_client_server -c am -i 10000 -s 10000 -a x.x.x.x
YuvalAbadi commented 8 months ago

Thanks

YuvalAbadi commented 8 months ago

when i use ./examples/ucp_client_server -s 1000 -i 100 -c am ./examples/ucp_client_server -s 1000 -i 100-c am -a localhost

the test seem to ends, no more prints but client not return both server and client 99% CPU I compile v.16x devel

how could I configured the have Rendezvous, server will reply to client , (ping-pong)

tvegas1 commented 8 months ago

the test seem to ends, no more prints but client not return both server and client 99% CPU I compile v.16x devel

that should be fixed by #9701

tvegas1 commented 8 months ago

merged #9701