ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
530 stars 371 forks source link

prov:EFA hang exchanging handshakes #7313

Closed jhh67 closed 1 year ago

jhh67 commented 2 years ago

While developing the Chapel runtime to use the efa provider I've run into a situation in which communication hangs exchanging handshake messages. I have two processes running on two different AWS nodes and I'm using the "delivery complete" semantics. One process calls fi_write to do an RMA write to the other process. This triggers a handshake exchange. The first node sends a handshake to the second node, which is received. The second node responds with a handshake, which is successfully sent, but the first node never receives a completion event. The first node continues to receive FI_EAGAIN on the calls to fi_write and continues to call fi_cq_read, but never gets the event. It appears to be a race, as it doesn't happen every time and its behavior changes depending on the amount of debugging output from both libfabric and my code. Any suggestions on how to debug this?

shefty commented 2 years ago

Someone from the AWS will need to discuss efa specific behavior. Are you using auto or manual progress? Is the peer making calls into libfabric (e.g. fi_cq_read) during this time?

It sounds like the handshake is handled using RMA write operations. Can you elaborate on what event you're expecting to see at the peer? You also mention that the first node sends the handshake, which is received, but the first node also continues to receive FI_EAGAIN on calls to fi_write. I don't understand the flow here.

wzamazon commented 2 years ago

Hi, thank you for reporting this!

One process calls fi_write to do an RMA write to the other process. This triggers a handshake exchange.

The first node sends a handshake to the second node, which is received. The second node responds with a handshake, which is successfully sent, but the first node never receives a completion event.

Which node called fi_write here? first or second?

Also, handshake is triggered by requests. e.g. when application call fi_write on node A, node A send a request to node B, node B will send a handshake back.

Handshake will not trigger a handshake. e.g. upon receiving a handshake, a node will not send a handshake.

jhh67 commented 2 years ago

Thank you for getting back to me. I'm using manual progress and "delivery complete". Here is the scenario as best I can determine. The application on Node A calls fi_write. Node A sends a handshake request to Node B. The fi_write returns FI_EAGAIN because a handshake has not been received, which causes the application to call fi_cq_read and fi_write again in a loop. Node B receives the request and responds with a handshake. Node A never receives the handshake. At this point the application on Node A is stuck in a loop repeatedly calling fi_cq_read and fi_write.

wzamazon commented 2 years ago

Hi, John,

From you description, the problem is:

Node B receives the request and responds with a handshake. Node A never receives the handshake.

Did Node B call the function rxr_pkt_post_handshake, and that function returned 0?

Also is it possible for you to provide a reproducer?

ronawho commented 2 years ago

@jhh67 is our libfabric expert here, but I believe he started vacation today, so I'll try to provide some higher level context in case it's helpful. ​We work on the Chapel programming language (https://chapel-lang.org / https://github.com/chapel-lang/chapel) and as a background task we've been trying to get our libfabric communication layer working with EFA.

We thought we had things working after adjusting to some EFA differences (e.g. not being able to share address vectors), but after upgrading from libfabric 1.11 to 1.13.2 we started seeing hangs that John bisected to af1a3eaa0dddaebeaac723dc97188379c9bc46ac, which added the handshake. We've been trying to figure out if we're doing something wrong on our end or if there's a provider bug, but we've started to feel out of our depth and wanted to reach out here.

Right now our only reproducer is a full Chapel program, which we were assuming is far too big of a reproducer to be helpful. I believe John tried getting some fabtests running but the few he was able to run didn't hit this issue. From some internal notes, it looks like John was going to try modifying the fi_rma_bw fabtest, but wasn't able to get that test running with EFA (which could easily just be due to lack of familiarity with running fabtests on our part.)

wzamazon commented 2 years ago

@ronawho

Thank you for the information!

As I mentioned before, the phenomenon John described that "Node B receives the request and responds with a handshake. Node A never receives the handshake" is the cause of the issue. But I cannot think of a scenario that it will happen. My suggestion is to check rxr_pkt_post_handshake was called and return 0 on node B, e.g. handshake was successfully sent.

To use fi_rma_bw with EFA, you need to have two EFA instances, one the first one, run:

fi_rma_bw -p efa -E -o write -U

on the second one, run

fi_rma_bw -p efa -E -o write -U <ip_of_first_instance>

The option -U enables delivery complete. Maybe compare the behavior of fi_rma_bw and chapel will help you?

Besides, the btl module of open mpi also uses EFA's fi_write with delivery complete, but that is a much bigger code base. benchmark like osu_put_bw will trigger that code path.

Hope it helps!

wzamazon commented 2 years ago

@ronawho

As I writing this, I just realized one scenario what John described that "Node B receives the request and responds with a handshake. Node A never receives the handshake" can happen, which is that:

When Node B send handshake, node A is not ready to receive data yet. e.g. it did not post a receive buffer. In this case, on node B, EFA will get an RNR error. The handshake packet will be queued and resend in node B's progress engine.

What I am trying to say is, for node A to receive handshake, node B must keep calling fi_cq_read too, which will call EFA's progress engine.

jhh67 commented 2 years ago

We think we've identified the problem. The nodes have separate transmit and receive endpoints. When Node A calls fi_write and gets FI_EAGAIN it repeatedly calls fi_cq_read and fi_write on the transmit endpoint, but does not call fi_cq_read on the receive endpoint. As a result, the handshake is not received. Does that seem correct? Should FI_PROGRESS_AUTO solve the problem? I tried that, but it seemed to have no effect.

wzamazon commented 2 years ago

Can you describing you applications in more detail?

There should be two nodes to do a communication. Does each node has 2 endpoints (so in total 4 endpoints)?

jhh67 commented 2 years ago

The application creates one transmit endpoint per core, and one receive endpoint for the entire application. So at a minimum there are two endpoints per process -- one for transmit and one for receive.

wzamazon commented 2 years ago

I see.

Yes, you will need to call fi_cq_read on the receive endpoint. (in this case it is not receiving, but the target of write).

EFA does not support FI_PROGRESS_AUTO, the provider should have failed when you requested FI_PROGRESS_AUTO.

FWIW, if you want to get a CQ entry on the target of write, you can use the FI_REMOTE_CQ_DATA flag on the initiator of the write.

See: https://ofiwg.github.io/libfabric/v1.6.1/man/fi_cq.3.html

and search for FI_REMOTE_CQ_DATA.

jhh67 commented 2 years ago

What about the initiator of the write, does it also have to do fi_cq_read on the receive endpoint because the handshake must be received behind the scenes?

The fi_domain man page says All providers are required to support FI_PROGRESS_AUTO. Is that incorrect? When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.

wzamazon commented 2 years ago

What about the initiator of the write, does it also have to do fi_cq_read on the receive endpoint because the handshake must be received behind the scenes?

The initiator of the write only call fi_cq_read on the endpoint that issued the write (not the receiving endpoint in).

Specifically, the application called fi_write with an ep, and application has a CQ bind to that ep, and application need to call fi_cq_read on that CQ (bind to the ep called fi_write).

For your application, looks like it opened 2 endpoints. One endpoint is used to issue write, the application need to call fi_cq_read on the CQ that is bind to that endpoint.

jhh67 commented 2 years ago

I've determined through experimentation that it is necessary to progress a receive endpoint in order for handshakes to be received. To recap, a process calls fi_write on a transmit-only endpoint, which triggers a handshake request to the remote endpoint. If the process does not progress a receive endpoint by calling fi_cq_read on it, the handshake is never received from the remote endpoint and the fi_write returns FI_EAGAIN indefinitely. The fi_efa documentation should be updated to reflect this requirement.

wzamazon commented 2 years ago

The fi_efa documentation should be updated to reflect this requirement.

Thank you! We will update our document accordingly.

Also, regarding your earlier comments

The fi_domain man page says All providers are required to support FI_PROGRESS_AUTO. Is that incorrect? When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.

The community discussed about this, and we agreed that it is too high a burden to require a provider to implement FI_PROGRESS_AUTO, therefore that line you were referring to has been removed from libfabric document.

bradcray commented 2 years ago

Thanks for the update @wzamazon. If I'm following correctly, I was curious about this:

When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.

Specifically: Do you see this in your runs as well, and if so, is it a bug?

wzamazon commented 2 years ago

When I specified FI_PROGRESS_AUTO I didn't get an error and the provider info returned by fi_getinfo had FI_PROGESS_AUTO set.

I have not tested this, but I believe you, and this is a bug. We will look into it.

wzamazon commented 2 years ago

@jhh67

We will proceed with necessary document change.

Unfortunately, there are some existing customers of EFA is asking for FI_PROGRESS_AUTO (wrongly). Change the code such that drop support of FI_PROGRESS_AUTO will break them. So at this moment we cannot make such change to our code. Long term, we will look into implementing true support of FI_PROGRESS_AUTO.

Is this issue blocking you from making your application working?

jhh67 commented 2 years ago

It's not blocking me, we are not using FI_PROGRESS_AUTO. I tried it while debugging my problem with lack of progress.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.