ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

fabtests: Synchronize on Initialization #10108

Closed zachdworkin closed 3 months ago

zachdworkin commented 3 months ago

fabtests: Synchronize on Initialization fabtests/functional: Add manual init sync to fi_rdm_multiclient

Some providers (verbs ud) might require the server to be fully initialized before the client process calls getinfo with the server address. This causes a No Data Available error due to the fi_info call failing during initialization by not being able to find the name on the server. This is seen most often in cases where a socket (usually oob [out of band]) is initialized before getinfo in the ft_init_fabric sequence. Adding a sync only if an oob socket has been initialized to order the initialization correctly will prevent this from happening.

The syncronization is for client to start getinfo only after the server is done initializing everything.

fi_rdm_multiclient test needs its client startup to have a manual sync since it does not follow the normal codepath of going through ft_init_fabric like the server and other tests do. This sync will only happen on the first client that connects because it is only necessary to give the server enough time to spin up all the resources. It is not necessary for future clients because the server has already started all of those resources.

zachdworkin commented 3 months ago

bot:aws:retest

zachdworkin commented 3 months ago

@shijin-aws can you please tell me why the first run of AWS CI failed?

shijin-aws commented 3 months ago

We have a multi-client test that involves fi_rdm failed

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.8.166 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_stdout:
libfabric:69633:1718843593::efa:ep_ctrl:efa_rdm_peer_destruct():56<warn> Closing EP with unacked CONNREQs in flight
Sending message...
Send completion received

client returncode: 0

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_stdout:

client returncode: 107

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -C 2 -p efa -E=9228'"'"''
server_stdout:

client_0_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.8.166 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_0_stdout:
libfabric:69633:1718843593::efa:ep_ctrl:efa_rdm_peer_destruct():56<warn> Closing EP with unacked CONNREQs in flight
Sending message...
Send completion received

client_1_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_1_stdout:
shijin-aws commented 3 months ago

I think the failure is related. Need to look into further.

zachdworkin commented 3 months ago

I agree that these changes broke the test. I think it might need some manual progression like how I had to add it to fi_rdm_multi_client. It looks like this test is just doing fi_rdm which goes through the main codepath and will have the second client hang while waiting to recv the oob socket from the server (the server only sends it once). Do you have any ideas for how to solve this problem?

j-xiong commented 3 months ago

OOB for multiple clients is handled in ft_accept_next_cleint, which calls ft_reset_oob() which in turns calls ft_init_oob(). You can try adding a send inside ft_accept_next_client after ft_reset_oob is called.

zachdworkin commented 3 months ago

@j-xiong how is this?

j-xiong commented 3 months ago

Looks good now. Let's see if CI would find any issue.

zachdworkin commented 3 months ago

@shijin-aws can you share the AWS CI failure?

shijin-aws commented 3 months ago

The same test failed again, looking at it now

shijin-aws commented 3 months ago

So the test is run like this

server expects to get messages from 2 clients

fi_rdm -C 2 -p efa -E

client 1 and 2 comes and left and does ping-pong with server via (Just run the following command twice )

fi_rdm -p efa -E <server_ip>

The sever does OK when the first client comes and left

But when the second client comes it starts to get the error

ft_sock_recv(): common/shared.c:3951, ret=-107 (Transport endpoint is not connected)
shijin-aws commented 3 months ago

I can reproduce the error with TCP provider as well

ubuntu@ip-172-31-39-234:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ fi_rdm -C 2 -p tcp -E
Waiting for message from client...
Data check OK
Received data from client: Hello from Client!
ft_sock_recv(): common/shared.c:3951, ret=-107 (Transport endpoint is not connected

So it looks the change broke the multi-client mode of fi_rdm