open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

Protect UCX endpoints against concurrent creation #12863

Closed bosilca closed 1 month ago

bosilca commented 1 month ago

If multiple threads initialize the same destination ep simultaneously we endup in situation where we either segfault or we use resources that we will never release (because they are overwritten by another thread). Thus, protect the ep creation with an atomic update, allowing a single thread to create the ep for a particular peer.

Fixes #12833.

rhc54 commented 1 month ago

FWIW: I will be fixing the segfault as that shouldn't happen. What remains unexplained is why the key is not being found, given that it was "put" prior to the "fence" in MPI_Init and circulated during that operation. This to me is very disturbing. Were you able to track down a reason for it?

bosilca commented 1 month ago

@rhc54 I have not looked into details on why PMIX_Get does not find the key. I noticed that concurrent PMIX_Get on the same proc happen because the UCX PML lacked proper protection when creating endpoints and I fixed that issue specifically. Thus, this patch fixes only the case where these calls are issued concurrently from our messaging layer and trying to extract information about the same peer., but does not prevent concurrent calls to PMIX_Get targeting different peers nor concurrent calls to PMIX_Get targeting the same peer but from different layers.

rhc54 commented 1 month ago

Yeah, it's puzzling. Clearly, we do find the key when OMPI is single-threaded, so we know it is being properly published and circulated (otherwise nothing would work), and properly retrieved in that situation. Somehow, when OMPI is multi-threaded, we cannot find the key - which is really strange as PMIx_Get just pushes you into the PMIx progress thread (which serializes all requests) to find the value.

It's almost like somehow you are getting to the ep formation point before we complete the fence and the data has been stored where PMIx_Get can find it. Not sure if/how that would be possible. I'll try to create a PMIx-only reproducer to see if this is something in PMIx itself, or something deeper is going on in multi-threaded OMPI.

rhc54 commented 1 month ago

Ahhh...I see what's going on. You might not need this after all - problem lies in PMIx. Put simply, when we do a "fence", we cache the collected data at the server. This helps control the memory footprint as not every piece of data will be required by every local client. The first time you ask for a piece of data for a particular proc, we return all the data "put" by that proc since it is likely you'll be asking for more keys.

So you do indeed "not found" on the first call to PMIx_Get - and subsequent calls need to be cached until the server can respond. In this multi-threaded case, you generate another request before the server's response is received - and that hits a bug in PMIx.

I'll get that fixed and it should resolve the observed issue. Totally up to you if you want to retain this serialization - I'm not sure if there are negative repercussions and/or what impact it might have.

rhc54 commented 1 month ago

Okay, one line fix - adding it to PMIx master branch now. You can track it here: https://github.com/openpmix/openpmix/pull/3415

bosilca commented 1 month ago

Thanks for your insight, after trying with your patch I was not able to replicate the initial issue. Thus, with the serialization in PMIX_Get and the UCX serialization on the worker for the endpoint creation this patch is unnecessary.