prov/psm3: av:psmx3_epid_to_epaddr crash

hzhou commented 2 years ago

Describe the bug We (MPICH) just switched to psm3 as default provider And we start to see this test failure:

libfabric:82619:psm3:av:psmx3_epid_to_epaddr():231<warn> psm2_ep_connect returned error Operation timed out, remote epid=a00018a03.Try setting FI_PSM3_CONN_TIMEOUT to a larger value (current: 10 seconds)

Details https://github.com/pmodels/mpich/issues/5975

To Reproduce MPICH main branch, run test (after build, install and ./configure in test/mpi) mpirun -n 5 test/mpi/coll/p_red with following environment:

export MPIR_CVAR_ODD_EVEN_CLIQUES=1
export MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0
export MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_tree
export MPIR_CVAR_IREDUCE_TREE_TYPE=kary
export MPIR_CVAR_IREDUCE_TREE_KVAL=3

Environment: Linux Centos 7

Additional context It appears to be a compiler optimization bug since it only shows up with the default gcc (4.8) compiler with default -O2.

timothom64 commented 2 years ago

Cornelis Networks doesn't manager psm3..only psm3.

Assigning to @acgoldma for triage

timothom64 commented 2 years ago

Looks like I can't manage who this is assigned to. @hzhou, can you assign this to @acgoldma?

hzhou commented 2 years ago

I can't assign tickets either. cc @j-xiong @shefty

j-xiong commented 2 years ago

@timothom64 Sorry, I was thinking about psm2 ...... reassigned.

hzhou commented 2 years ago

I think I found the root cause. P1 send a message to P0, completes (using fi_inject), went on Finalize and exit. Meanwhile P0 issues fi_trecv, which calls psmx3_epid_to_epaddr, which tries to connect to P1, but since P1 is gone, it can't complete. @acgoldma can you confirm that's the case and can you suggest a work-around? It is common for one process send a small message and exit before another process tries to receive the message.

hzhou commented 2 years ago

Alright, this is actually the same issue that we encounter with sockets provider, that fi_tinject does not generate cq entry and fi_close does not ensure injected messages are flushed out before closing. Applying the same workaround as we did with sockets provider -- forcing a round of fi_tsenddata and completing it will flush out all the pending injections -- seems to work.

This is ugly and I believe it really should be the provider's job to make sure injected messages are flushed out before closing.

shefty commented 2 years ago

Applications cannot rely on data being delivered when using inject calls. Yes, a provider can try to flush the data, but ultimately, the buffering may be outside the control of the provider, such as in the kernel. And because of flow control, there's no way to ensure that the data will be sent when an endpoint is closed.

Fabtest handles this by using delivery complete as part of its finalize processing.

hzhou commented 2 years ago

Looks like my work around does not work reliably -- the error psmx3_epid_to_epaddr():231<warn> psm2_ep_connect returned error Operation timed out still occurs.

@shefty

And because of flow control, there's no way to ensure that the data will be sent when an endpoint is closed.

If provider can't ensure that, how would us -- MPI library -- ensure it? If we can't ensure message delivery, then injection is useless to us. Every provider knows how its injection works and what is necessary to ensure its message delivery. If provider cannot ensure it locally, libfabric needs to provide a query interface that tell us exactly what we need to do.

Fabtest handles this by using delivery complete as part of its finalize processing.

Are you suggesting an all-to-all fi_sendmsg with FI_DELIVERY_COMPLETE at finalize?

shefty commented 2 years ago

I'm suggesting that there needs to be some finalize step that relies on delivery complete semantics. Transmit complete might be acceptable in practice, and transmit complete is what most verbs devices adhere to. Inject complete will rarely work, nor will the inject call that never even generates a completion. Inject does not guarantee message delivery, you need a completion for that with a strong completion semantic.

hzhou commented 2 years ago

Let's be clear that we are not discussing guaranteeing delivery of any single injected message because that will defeat the point of using injected messages in the first place, right?

The question is what shall we do before a process exit to ensure previous injected messages do not get lost due to exiting. Seems different providers and different networks all have different tricks. What do you suggest? @shefty

A side note: we are using a PMI_Barrier for verbs provider to ensure all processes arrive at Finalize before exiting. This is very undesirable especially when dynamic processes are involved. I'm also worried that PMI_Barrier may not work for some providers if they require active progress to complete the message delivery.

PS: it is important to note that we are not just dealing with 2 processes. Let's say we have 1 million processes, what kind of "finalize step" using delivery complete semantics shall we do?

shefty commented 2 years ago

I'm only talking about the finalize step.

I think it will help to highlight the problem. Imagine 2 peers sending to each other using inject. They insert each other's addresses, call inject, then close their resources. Locally, each provider queues the data transfer, but before the transfer can be placed on the network, the app closes the connection. Imagine if peer 1 is running slightly ahead of peer 2 in this case. Before closing the connection, peer 1 flushes its send queue. The transfer reaches peer 2 and is acked. At that point, peer 1 is done, so all resources are cleaned up. The provider has no knowledge that peer 2 has sent a message its way, and as a result, peer's 2 message gets dropped since there's no active connection at peer 1 when it arrives. We've seen this problem consistently in practice.

To simplify things, let's assume that we have send-after-send ordering. Then one way to handle this is for each peer to send a message using delivery complete semantics, and wait until both the send completes and it receives a message from the other peer prior to closing its resources. (Note that even in this case, there's the potential of losing the 'last ack'.)

At scale, we should not need to do all-to-all communication. But locally, each peer needs to know if it has received all messages targeting it. I believe some sort of ring or tree based finalize step would work if a peer has this data.

hzhou commented 2 years ago

At scale, we should not need to do all-to-all communication.

Good. But we need a sure, promised way from libfabric that it will work consistently. "I believe" is not going to cut it.

But locally, each peer needs to know if it has received all messages targeting it.

For our cases, it is not the receiver going away. The receiver is in progress loop waiting for message to arrive. It knows it has not received all messages targeting it. The problem is the sender going away, leaving the receiver hanging or crashing.

I believe some sort of ring or tree based finalize step would work if a peer has this data.

We have finalizing step using a ring like flushing send at finalize. It works for sockets provider, but it does not work for verbs or psm3. As the case showing in this issue, P0 and P1 never connected. Having P1 sending a ring message to P2 does not fix the issue. P0 is still in receiving progress, not at finalize yet. My question is, P1 at provider level know it has an outgoing message to P0 and it knows it has not connected to P0 yet -- correct me if I am off -- why P1 is allowed to exit?

EDIT: on double-check, our finalize flush only sends self message since that works for sockets provider. I guess a ring-like flush send may work since that'll ensure every process arrives at the final step. But nevertheless, my question for a better solution remains. With dynamic processes or when each process have multiple endpoints, a ring-messaging are difficult to arrange. Thus we need a libfabric solution.

EDIT2: Another difficulty is only the sender can keep stats on how many injected messages going toward how many processes -- either us or libfabric provider can keep notes. But a receiver does not know whether a message is from an injected send or normal send, at least we don't know, but maybe libfabric provider knows?

EDIT3: update https://github.com/pmodels/mpich/pull/5997, it seems in this case -- psm3 over verbs -- simply having a PMI_Barrier at finalize takes care of the issue.

shefty commented 2 years ago

There is no one single solution that will be optimized for all communication models.

If a sender wants to know that a receiver has a message, it needs to ask for at least FI_TRANSMIT_COMPLETE. That guarantees that the message has made it to the remote NIC. Increasing the completion semantic to FI_DELIVERY_COMPLETE guarantees the message has landed in a memory buffer owned by the process.

If SAS ordering is set, a completion guarantees that previously sent messages have completed to that same level. If SAS is not set, then the last message must use FI_FENCE to give this guarantee.

This means if SAS is set, a sender needs to know if the last message sent used inject. If so, then it likely needs a finalize message. The unknown here is that the receiver could notify the sender itself that the message was received, which would make the finalize unnecessary. If SAS is not set, then the sender likely needs a finalize message if any previous message was sent using inject. The unknown here is the same as before; the receive could notify the sender that a sent message was received, though in this case, the receive likely needs to report the total number received.

If a receiver does not know if it will receive a message, then the responsibility is on the sender to ensure that it arrives.

hzhou commented 2 years ago

Regarding how to finalize without losing injected messages, libfabric has all the information we have, plus more internal mechanism details. We are a communication library just as libfabric is. What do you think that we can do at finalize that libfabric cannot do upon closing an endpoint? Otherwise, I think your answer is essentially that this is user's responsibility. Unfortunately, we cannot do the same.

PS: we can arrange a barrier at finalize, which libfabric can't do. If that is the case, we'd like a clear statement from libfabric that whether a barrier at finalize sufficient. Does it matter if we do a PMI_Barrier -- one that doesn't invoke libfabric progress?

shefty commented 2 years ago

libfabric does not have full knowledge of the application. It can only track point to point transfers. It also should not set a policy for what should be done with pending data transfers when the application decides to close an endpoint. That could result in close taking an arbitrary long time.

libfabric cannot make any claims about an application implementation of some higher level collective operation, such as barrier. If the data progress model is manual, then progress is not guaranteed unless the app calls into libfabric.

hzhou commented 2 years ago

libfabric does not have full knowledge of the application. It can only track point to point transfers.

MPI does not have full knowledge of the applications either. Collectives are all breaking down to point-to-point messages, it is not practical for us to control/track collective algorithms either.

It also should not set a policy for what should be done with pending data transfers when the application decides to close an endpoint. That could result in close taking an arbitrary long time.

If not policy, it can provide as an option. We can choose between correctness and controlled finalize time. And in the case for arbitrary long waiting time, we will need to understand what it is waiting.

libfabric cannot make any claims about an application implementation of some higher level collective operation, such as barrier. If the data progress model is manual, then progress is not guaranteed unless the app calls into libfabric.

We are not asking for such claims. We are asking what do you need. If you need progress calls into libfabric, then that's what we have to do. If you need a barrier, then we'll do that. Currently, with verbs provider, it seems it just need a barrier without libfabric progress. We don't have much clues on why some providers/hardware require some measures (outside libfabric man pages) but other providers/hardware do fine. Libfabric is supposed to manage these provider specifics for us, right? I mean, specificness is fine, but there should be a standard way to query/manage them, right?

==== We are not making progress. One of the reason is I don't understand why a provider does not know whether they can tear down a resource without losing an on-going message, injected or not. You insist this is unslovable , but you cannot convince me by simply repeating the statement. Can you pick a particular provider, explain its mechanism of implementing injected message, and explain to me why it can't possibly check and tell whether a resource (endpoint) can be safely closed?

hzhou commented 2 years ago

Just to note that we have worked around this issue (psm3 on verbs) by forcing a PMI_Barrier at MPI_Finalize. Although we still wish for a cleaner solution from the provider, we are okay to close this issue if a consensus cannot be reached in the near term.

ToddRimmer commented 2 years ago

@hzhou - you may also want to look at OpenMPI and check with some of the Intel MPICH experts on how this is handled by those MPIs.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.

ofiwg / libfabric

prov/psm3: av:psmx3_epid_to_epaddr crash #7741