ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
532 stars 371 forks source link

prov/shm: In-progress send via CMA(iov protocol) blocks following sends #9853

Open wenduwan opened 4 months ago

wenduwan commented 4 months ago

Describe the bug We have observed an MPI application hang in shm between 2 processes where: Receiver:

Sender:

While the application waits for all requests to complete, we observed:

Upon investigation, we found Send2 was using prov/shm's iov protocol via CMA, and it was stuck in progress due to the absence of a matching recv. Furthermore, it blocked subsequent send operations, i.e. Send3.

To Reproduce I will provide a simpler reproducer later.

Expected behavior I'm under the assumption that send operations should be progressed and completed independently, and not block subsequent sends.

In this case, we should at least ensure that Recv1, Recv2, Send1 and Send3 all complete.

Output If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Environment: Multiple OS including Amazon Linux 2 and Ubuntu 22.04

Additional context Add any other context about the problem here.

hppritcha commented 4 months ago

Can you check if you see this issue with the 4.1.x release stream? the cma usage has been around for a long time so i'm surprised the btl/vader (aka sm) BTL isn't handling this sort of case correctly.

shijin-aws commented 4 months ago

I talked with @aingerson and she pointed out changing https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_progress.c#L221 to continue may work. But IIUC we need more change to shift the pointer of the cirque entry to the next. Currently the loop always polls the head of the cirque and discard the progressed one to move forward.

shijin-aws commented 4 months ago

Can you check if you see this issue with the 4.1.x release stream? the cma usage has been around for a long time so i'm surprised the btl/vader (aka sm) BTL isn't handling this sort of case correctly.

I don't think it's related to Ompi version. It's a restriction in Libfabric shm provider's CMA protocol implementation and currently can be exposed by running OMPI with EFA provider in libfabric >= 1.19 version (which uses shm provider as peer provider and offload the unexpected message handling to shm completely)

hppritcha commented 4 months ago

I don't think it's related to Ompi version. It's a restriction in Libfabric shm provider's CMA protocol implementation and currently can be exposed by running OMPI with EFA provider in libfabric >= 1.19 version (which uses shm provider as peer provider and offload the unexpected message handling to shm completely)

oh sorry I missed that.

shijin-aws commented 4 months ago

The customer is currently not blocked by this issue after modifying their application to not wait on send completions before posting receives. But this should be fixed after switching to the new shm developed by @aingerson, likely in libfabric 1.22 or 2.0

wenduwan commented 4 months ago

I should mention that the user is lucky in this case to be able to mitigate by modifying the application - I wouldn't be surprised if they run into a scenarios and get stuck - and we should fix this soon.

aingerson commented 4 months ago

@wenduwan Thanks for the context. We will work on getting new shm ready. We just need to root caused the inline performance issues and then we should be ready to move.

shefty commented 4 months ago

Can you disable the offending protocol to avoid the hang?

wenduwan commented 4 months ago

Can you disable the offending protocol to avoid the hang?

We can. This is an alternative with a (slight) performance penalty, by switching to SAR. Therefore we did not recommend it to the user.