Closed zachdworkin closed 3 months ago
bot:aws:retest
@shijin-aws can you please tell me why the first run of AWS CI failed?
We have a multi-client test that involves fi_rdm failed
client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.8.166 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_stdout:
libfabric:69633:1718843593::efa:ep_ctrl:efa_rdm_peer_destruct():56<warn> Closing EP with unacked CONNREQs in flight
Sending message...
Send completion received
client returncode: 0
client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_stdout:
client returncode: 107
server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -C 2 -p efa -E=9228'"'"''
server_stdout:
client_0_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.8.166 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_0_stdout:
libfabric:69633:1718843593::efa:ep_ctrl:efa_rdm_peer_destruct():56<warn> Closing EP with unacked CONNREQs in flight
Sending message...
Send completion received
client_1_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_1_stdout:
I think the failure is related. Need to look into further.
I agree that these changes broke the test. I think it might need some manual progression like how I had to add it to fi_rdm_multi_client. It looks like this test is just doing fi_rdm which goes through the main codepath and will have the second client hang while waiting to recv the oob socket from the server (the server only sends it once). Do you have any ideas for how to solve this problem?
OOB for multiple clients is handled in ft_accept_next_cleint
, which calls ft_reset_oob()
which in turns calls ft_init_oob()
. You can try adding a send inside ft_accept_next_client
after ft_reset_oob
is called.
@j-xiong how is this?
Looks good now. Let's see if CI would find any issue.
@shijin-aws can you share the AWS CI failure?
The same test failed again, looking at it now
So the test is run like this
server expects to get messages from 2 clients
fi_rdm -C 2 -p efa -E
client 1 and 2 comes and left and does ping-pong with server via (Just run the following command twice )
fi_rdm -p efa -E <server_ip>
The sever does OK when the first client comes and left
But when the second client comes it starts to get the error
ft_sock_recv(): common/shared.c:3951, ret=-107 (Transport endpoint is not connected)
I can reproduce the error with TCP provider as well
ubuntu@ip-172-31-39-234:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ fi_rdm -C 2 -p tcp -E
Waiting for message from client...
Data check OK
Received data from client: Hello from Client!
ft_sock_recv(): common/shared.c:3951, ret=-107 (Transport endpoint is not connected
So it looks the change broke the multi-client mode of fi_rdm
fabtests: Synchronize on Initialization fabtests/functional: Add manual init sync to fi_rdm_multiclient
Some providers (verbs ud) might require the server to be fully initialized before the client process calls getinfo with the server address. This causes a No Data Available error due to the fi_info call failing during initialization by not being able to find the name on the server. This is seen most often in cases where a socket (usually oob [out of band]) is initialized before getinfo in the ft_init_fabric sequence. Adding a sync only if an oob socket has been initialized to order the initialization correctly will prevent this from happening.
The syncronization is for client to start getinfo only after the server is done initializing everything.
fi_rdm_multiclient test needs its client startup to have a manual sync since it does not follow the normal codepath of going through ft_init_fabric like the server and other tests do. This sync will only happen on the first client that connects because it is only necessary to give the server enough time to spin up all the resources. It is not necessary for future clients because the server has already started all of those resources.