Closed jhh67 closed 1 year ago
Someone from the AWS will need to discuss efa specific behavior. Are you using auto or manual progress? Is the peer making calls into libfabric (e.g. fi_cq_read) during this time?
It sounds like the handshake is handled using RMA write operations. Can you elaborate on what event you're expecting to see at the peer? You also mention that the first node sends the handshake, which is received, but the first node also continues to receive FI_EAGAIN on calls to fi_write. I don't understand the flow here.
Hi, thank you for reporting this!
One process calls fi_write to do an RMA write to the other process. This triggers a handshake exchange.
The first node sends a handshake to the second node, which is received. The second node responds with a handshake, which is successfully sent, but the first node never receives a completion event.
Which node called fi_write
here? first or second?
Also, handshake is triggered by requests. e.g. when application call fi_write
on node A, node A send a request to node B, node B will send a handshake back.
Handshake will not trigger a handshake. e.g. upon receiving a handshake, a node will not send a handshake.
Thank you for getting back to me. I'm using manual progress and "delivery complete". Here is the scenario as best I can determine. The application on Node A calls fi_write
. Node A sends a handshake request to Node B. The fi_write
returns FI_EAGAIN
because a handshake has not been received, which causes the application to call fi_cq_read
and fi_write
again in a loop. Node B receives the request and responds with a handshake. Node A never receives the handshake. At this point the application on Node A is stuck in a loop repeatedly calling fi_cq_read
and fi_write
.
Hi, John,
From you description, the problem is:
Node B receives the request and responds with a handshake. Node A never receives the handshake.
Did Node B call the function rxr_pkt_post_handshake
, and that function returned 0?
Also is it possible for you to provide a reproducer?
@jhh67 is our libfabric expert here, but I believe he started vacation today, so I'll try to provide some higher level context in case it's helpful. We work on the Chapel programming language (https://chapel-lang.org / https://github.com/chapel-lang/chapel) and as a background task we've been trying to get our libfabric communication layer working with EFA.
We thought we had things working after adjusting to some EFA differences (e.g. not being able to share address vectors), but after upgrading from libfabric 1.11 to 1.13.2 we started seeing hangs that John bisected to af1a3eaa0dddaebeaac723dc97188379c9bc46ac, which added the handshake. We've been trying to figure out if we're doing something wrong on our end or if there's a provider bug, but we've started to feel out of our depth and wanted to reach out here.
Right now our only reproducer is a full Chapel program, which we were assuming is far too big of a reproducer to be helpful. I believe John tried getting some fabtests running but the few he was able to run didn't hit this issue. From some internal notes, it looks like John was going to try modifying the fi_rma_bw
fabtest, but wasn't able to get that test running with EFA (which could easily just be due to lack of familiarity with running fabtests on our part.)
@ronawho
Thank you for the information!
As I mentioned before, the phenomenon John described that "Node B receives the request and responds with a handshake. Node A never receives the handshake" is the cause of the issue. But I cannot think of a scenario that it will happen. My suggestion is to check rxr_pkt_post_handshake
was called and return 0 on node B, e.g. handshake was successfully sent.
To use fi_rma_bw
with EFA, you need to have two EFA instances, one the first one, run:
fi_rma_bw -p efa -E -o write -U
on the second one, run
fi_rma_bw -p efa -E -o write -U <ip_of_first_instance>
The option -U
enables delivery complete. Maybe compare the behavior of fi_rma_bw
and chapel will help you?
Besides, the btl module of open mpi also uses EFA's fi_write
with delivery complete, but that is a much bigger code base. benchmark like osu_put_bw
will trigger that code path.
Hope it helps!
@ronawho
As I writing this, I just realized one scenario what John described that "Node B receives the request and responds with a handshake. Node A never receives the handshake" can happen, which is that:
When Node B send handshake, node A is not ready to receive data yet. e.g. it did not post a receive buffer. In this case, on node B, EFA will get an RNR error. The handshake packet will be queued and resend in node B's progress engine.
What I am trying to say is, for node A to receive handshake, node B must keep calling fi_cq_read
too, which will call EFA's progress engine.
We think we've identified the problem. The nodes have separate transmit and receive endpoints. When Node A calls fi_write
and gets FI_EAGAIN
it repeatedly calls fi_cq_read
and fi_write
on the transmit endpoint, but does not call fi_cq_read
on the receive endpoint. As a result, the handshake is not received. Does that seem correct? Should FI_PROGRESS_AUTO
solve the problem? I tried that, but it seemed to have no effect.
Can you describing you applications in more detail?
There should be two nodes to do a communication. Does each node has 2 endpoints (so in total 4 endpoints)?
The application creates one transmit endpoint per core, and one receive endpoint for the entire application. So at a minimum there are two endpoints per process -- one for transmit and one for receive.
I see.
Yes, you will need to call fi_cq_read
on the receive endpoint. (in this case it is not receiving, but the target of write).
EFA does not support FI_PROGRESS_AUTO, the provider should have failed when you requested FI_PROGRESS_AUTO.
FWIW, if you want to get a CQ entry on the target of write, you can use the FI_REMOTE_CQ_DATA flag on the initiator of the write.
See: https://ofiwg.github.io/libfabric/v1.6.1/man/fi_cq.3.html
and search for FI_REMOTE_CQ_DATA.
What about the initiator of the write, does it also have to do fi_cq_read
on the receive endpoint because the handshake must be received behind the scenes?
The fi_domain
man page says All providers are required to support FI_PROGRESS_AUTO
. Is that incorrect? When I specified FI_PROGRESS_AUTO
I didn't get an error and the provider info returned by fi_getinfo
had FI_PROGESS_AUTO
set.
What about the initiator of the write, does it also have to do fi_cq_read on the receive endpoint because the handshake must be received behind the scenes?
The initiator of the write only call fi_cq_read
on the endpoint that issued the write (not the receiving endpoint in).
Specifically, the application called fi_write
with an ep
, and application has a CQ bind to that ep
, and application need to call fi_cq_read
on that CQ (bind to the ep called fi_write
).
For your application, looks like it opened 2 endpoints. One endpoint is used to issue write, the application need to call fi_cq_read
on the CQ that is bind to that endpoint.
I've determined through experimentation that it is necessary to progress a receive endpoint in order for handshakes to be received. To recap, a process calls fi_write
on a transmit-only endpoint, which triggers a handshake request to the remote endpoint. If the process does not progress a receive endpoint by calling fi_cq_read
on it, the handshake is never received from the remote endpoint and the fi_write
returns FI_EAGAIN
indefinitely. The fi_efa
documentation should be updated to reflect this requirement.
The fi_efa documentation should be updated to reflect this requirement.
Thank you! We will update our document accordingly.
Also, regarding your earlier comments
The fi_domain man page says All providers are required to support FI_PROGRESS_AUTO. Is that incorrect? When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.
The community discussed about this, and we agreed that it is too high a burden to require a provider to implement FI_PROGRESS_AUTO, therefore that line you were referring to has been removed from libfabric document.
Thanks for the update @wzamazon. If I'm following correctly, I was curious about this:
When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.
Specifically: Do you see this in your runs as well, and if so, is it a bug?
When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.
I have not tested this, but I believe you, and this is a bug. We will look into it.
@jhh67
We will proceed with necessary document change.
Unfortunately, there are some existing customers of EFA is asking for FI_PROGRESS_AUTO (wrongly). Change the code such that drop support of FI_PROGRESS_AUTO will break them. So at this moment we cannot make such change to our code. Long term, we will look into implementing true support of FI_PROGRESS_AUTO.
Is this issue blocking you from making your application working?
It's not blocking me, we are not using FI_PROGRESS_AUTO
. I tried it while debugging my problem with lack of progress.
This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.
While developing the Chapel runtime to use the
efa
provider I've run into a situation in which communication hangs exchanging handshake messages. I have two processes running on two different AWS nodes and I'm using the "delivery complete" semantics. One process callsfi_write
to do an RMA write to the other process. This triggers a handshake exchange. The first node sends a handshake to the second node, which is received. The second node responds with a handshake, which is successfully sent, but the first node never receives a completion event. The first node continues to receive FI_EAGAIN on the calls tofi_write
and continues to callfi_cq_read
, but never gets the event. It appears to be a race, as it doesn't happen every time and its behavior changes depending on the amount of debugging output from both libfabric and my code. Any suggestions on how to debug this?