Closed sekoenig closed 2 years ago
That could be somehow specific to UCX.
What if you mpirun --mca pml ^ucx ...
?
FWIW, here is a C+OpenMP version of the test program
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int requested = MPI_THREAD_MULTIPLE, provided;
MPI_Init_thread(&argc, &argv, requested, &provided);
if (provided != requested)
{
fprintf(stderr, "Failed to initialize MPI with full thread support!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
int mr, nr;
MPI_Comm_rank(MPI_COMM_WORLD, &mr);
MPI_Comm_size(MPI_COMM_WORLD, &nr);
const size_t dim = 1024 * 1024;
const size_t chunk_size = 16;
const size_t chunk_count = dim / chunk_size;
double *buffers[nr];
MPI_Comm comms[nr];
for (int r = 0; r < nr; ++r)
{
buffers[r] = (double *)calloc (dim, sizeof(double));
MPI_Comm_dup(MPI_COMM_WORLD, comms+r);
}
MPI_Datatype chunk_type;
MPI_Type_contiguous(chunk_size, MPI_DOUBLE, &chunk_type);
MPI_Type_commit(&chunk_type);
const int repeat = 100;
for (int i = 0; i < repeat; ++i)
{
if (mr == 0) printf( "Pass = %d\n", i);
#pragma omp parallel for schedule(static)
for (int r=0; r<nr; r++)
{
if (r == mr) for(int i=0; i<dim; i++) buffers[r][i] = (double)r;
printf("%d: %d/%d\n", mr, r, nr);
#if 1
MPI_Bcast(
buffers[r], chunk_count,
chunk_type,
r, comms[r]
);
#else
/* No issue if this is used instead of the above: */
MPI_Bcast(
buffers[r], dim,
MPI_DOUBLE,
r, comms[r]
);
#endif
}
}
printf("x\n");
for (int r=0; r<nr; r++) {
MPI_Comm_free(comms+r);
}
MPI_Type_free(&chunk_type);
MPI_Finalize();
}
from pml_ucx_datatype.h
:
#ifdef HAVE_UCP_REQUEST_PARAM_T
__opal_attribute_always_inline__
static inline pml_ucx_datatype_t*
mca_pml_ucx_get_op_data(ompi_datatype_t *datatype)
{
pml_ucx_datatype_t *ucp_type = (pml_ucx_datatype_t*)datatype->pml_data;
if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID)) {
return ucp_type;
}
mca_pml_ucx_init_datatype(datatype);
return (pml_ucx_datatype_t*)datatype->pml_data;
}
this is not thread safe: mca_pml_ucx_init_datatype()
should not be called on the same datatype by two concurrent threads.
@yosefe can you please have a look?
That could be somehow specific to UCX. What if you
mpirun --mca pml ^ucx ...
?
I tried with export OMPI_MCA_pml=^ucx
because I need to use SLURM (srun) instead of mpirun. Unfortunately I cannot really test because now I get Open MPI failed to TCP connect to a peer MPI process.
. Not sure if there's a way around UCX on the cluster I am using.
You can try to restrict to a TCP network known to work, for example
export OMPI_MCA_btl_tcp_if_include=192.168.0.0/24
Thanks for the suggestion and for investigating the issue! I think it's best I open a ticket tomorrow with the cluster support team to help me test this. I will do that tomorrow and report the result here.
With the help of the HPC support team I was able to run the test code now without UCX. The specific settings I used are
export OMPI_MCA_pml='^ucx'
export OMPI_MCA_btl='^uct'
Interestingly, using the contiguous type still causes a problem, but now it's different:
Pass = 0
Pass = 1
AddressSanitizerAddressSanitizer:DEADLYSIGNAL
:DEADLYSIGNAL
=================================================================
[jwb0149:19603:0:19632] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa0)
==19603==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x14cfe9a8f933 bp 0x14cf8fb13ff0 sp 0x14cf8fb13f00 T11)
==19603==The signal is caused by a READ memory access.
==19603==Hint: address points to the zero page.
#0 0x14cfe9a8f933 in mca_rcache_grdma_register (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_rcache_grdma.so+0x2933)
#1 0x14cfe9a17bd0 in mca_btl_openib_register_mem (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_btl_openib.so+0xcbd0)
#2 0x14cfa466adc0 in mca_pml_ob1_rdma_btls (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ob1.so+0x10dc0)
#3 0x14cfa4668019 in mca_pml_ob1_send_request_start_seq (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ob1.so+0xe019)
#4 0x14cfa466703e in mca_pml_ob1_send (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ob1.so+0xd03e)
#5 0x14cff1ae967b in ompi_coll_base_sendrecv_actual (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0xb267b)
#6 0x14cff1ae5cc2 in ompi_coll_base_bcast_intra_scatter_allgather (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0xaecc2)
#7 0x14cfe91101af in ompi_coll_tuned_bcast_intra_dec_fixed (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_coll_tuned.so+0x71af)
#8 0x14cff1aa6089 in PMPI_Bcast (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0x6f089)
In particular, this only appears after one successful pass. (That could be coincidental though if we are dealing with some sort of race condition.)
On the other hand, running the comparison with MPI_DOUBLE, I get this:
Pass = 0
Pass = 1
Pass = 2
[[14795,14081],9][btl_openib_component.c:3689:handle_wc] from jwb0097.juwels to: jwb0097i error polling LP CQ with status REMOTE ACCE
SS ERROR status number 10 for wr_id 61e000002c98 opcode 128 vendor error 136 qp_idx 3
srun: error: jwb0097: tasks 8,10-11: Terminated
So it's a bit murky and perhaps the non-UCX test just crashed because I am using a non officially supported mode on the cluster. I recommend for the time being to focus on the issue I originally reported.
--mca pml ob1 --mca btl self,tcp,sm --mca btl_tcp_if_include x.y.x.t/s
.@bosilca my point was it could be better to avoid overwritting datatype->pml_data
with (a pointer to a) similar data in the first place, and hence avoiding to concurrently set the datatype.
FWIW, here is a quick and dirty proof of concept that seems to fix the issue in my environment
diff --git a/ompi/mca/pml/ucx/pml_ucx_datatype.h b/ompi/mca/pml/ucx/pml_ucx_datatype.h
index 8e1fbba..97653d1 100644
--- a/ompi/mca/pml/ucx/pml_ucx_datatype.h
+++ b/ompi/mca/pml/ucx/pml_ucx_datatype.h
@@ -14,6 +14,7 @@
#define PML_UCX_DATATYPE_INVALID 0
+#define PML_UCX_DATATYPE_PENDING 1
#ifdef HAVE_UCP_REQUEST_PARAM_T
typedef struct {
@@ -49,9 +50,17 @@ static inline ucp_datatype_t mca_pml_ucx_get_datatype(ompi_datatype_t *datatype)
#ifdef HAVE_UCP_REQUEST_PARAM_T
pml_ucx_datatype_t *ucp_type = (pml_ucx_datatype_t*)datatype->pml_data;
- if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID)) {
+ if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID && (int64_t)ucp_type != PML_UCX_DATATYPE_PENDING)) {
return ucp_type->datatype;
}
+ int64_t oldval = PML_UCX_DATATYPE_INVALID;
+ if (opal_atomic_compare_exchange_strong_64((int64_t *)&datatype->pml_data, &oldval, PML_UCX_DATATYPE_PENDING)) {
+ ucp_datatype_t res = mca_pml_ucx_init_datatype(datatype);
+ return res;
+ } else {
+ while(PML_UCX_DATATYPE_PENDING == datatype->pml_data);
+ return (ucp_datatype_t)datatype->pml_data;
+ }
#else
ucp_datatype_t ucp_type = datatype->pml_data;
@@ -70,11 +79,16 @@ mca_pml_ucx_get_op_data(ompi_datatype_t *datatype)
{
pml_ucx_datatype_t *ucp_type = (pml_ucx_datatype_t*)datatype->pml_data;
- if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID)) {
+ if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID && (int64_t)ucp_type != PML_UCX_DATATYPE_PENDING)) {
return ucp_type;
}
+ int64_t oldval = PML_UCX_DATATYPE_INVALID;
+ if (opal_atomic_compare_exchange_strong_64((int64_t *)&datatype->pml_data, &oldval, PML_UCX_DATATYPE_PENDING)) {
+ mca_pml_ucx_init_datatype(datatype);
+ } else {
+ while(PML_UCX_DATATYPE_PENDING == datatype->pml_data);
+ }
- mca_pml_ucx_init_datatype(datatype);
return (pml_ucx_datatype_t*)datatype->pml_data;
}
You're right, preventing concurrent access to mca_pml_ucx_init_datatype
can be done with an atomic operation in mca_pml_ucx_get_datatype
. Nice patch !
Naïve question based on my superficial understanding of things: would it not make sense to do this kind of initialization within the implementation of MPI_Type_commit
?
You could indeed do it during MPI_Type_commit. If not all committed types are used for point-to-point communications you would waste some memory, but in exchange you are getting rid of the atomic construct to protect the datatype creation.
You could indeed do it during MPI_Type_commit. If not all committed types are used for point-to-point communications you would waste some memory, but in exchange you are getting rid of the atomic construct to protect the datatype creation.
Sounds to me like creating a large number of unused types is something that could/should be addressed at the application level, while getting rid of this initialization within the communication routines could even bring a (presumably small) performance improvement.
But admittedly I do not know the history of/reasoning for the current design, and ultimately I am happy to see it fixed in any way :)
Currently, there is no mechanism to invoke a (pml
specific) callback in MPI_Type_commit()
.
This is something we should at least (carefully) consider, but since that would likely break internal ABI, that is unlikely to happen anytime soon.
2. You can use IPoIB to restrict OMPI to only use the TCP over IP feature. Once IPoIB is configured you can mpirun with `--mca pml ob1 --mca btl self,tcp,sm --mca btl_tcp_if_include x.y.x.t/s`.
I finally managed to try this with more help from the HPC support team. I had to replace sm
with vader
because otherwise I get As of version 3.0.0, the "sm" BTL is no longer available in Open MPI.
, but then with the proper TCP network info set my test program completes fine, with contiguous derived datatype and with plain MPI_DOUBLE.
The Address Sanitizer reports a couple of (supposed) memory leaks that may be worth following up on some time, but for now I'll leave it.
hi @sekoenig thank you for bug report.
sorry for delay in replay - we are in release process
could you test this PR: https://github.com/open-mpi/ompi/pull/10298 there added lock on UCX datatype manipulation which may help to resolve issue
thank you again and sorry for delay
Thanks a lot for following up and fixing this! I cannot test easily because I am working with the OpenMPI installation deployed on the HPC cluster I am using. I will check with the support team there if I might possibly compile a custom OpenMPI version within my home folder.
@sekoenig the simplest way to test is build UCX with this PR and use
LD_PRELOAD=
Okay, that might be doable without too much trouble. I will give it a shot.
Question after a conversation with the HPC support team: is is really UCX (libucp.so) that I should be building with the PR merged? They pointed out it seems I rather need a new mca_pml_ucx.so, which would mean building a custom OpenMPI. It may well be possible to get that set up.
uuups, sorry... of-course you need PML UCX updated, not UCX. my fault. sorry for that. UCX is not updated here
I have been trying together with the HPC support stuff, but unfortunetely we cannot get it to work. The problem is that we need to manually compile OpenMPI 5.0, which does not work with the PMIX installation deployed on the cluster (there are compiler errors, presumably that version is too old). Using the PMIX that comes with the ompi source tree does not seem to be an option, because the cluster has PMIX integrated with their SLURM installation.
Would you be able perhaps to backport the fix to OpenMPI 4.1.x? That would be much easier to test, and presumably it would generally be good to have the bug fixed in older versions as well.
sure, I will make PR tomorrow & let you know
Great, thank you!
@sekoenig welcome to test: https://github.com/open-mpi/ompi/pull/10340
Sorry for the delay, it was more difficult than anticipated to set up a custom OpenMPI built on the cluster. Many thanks to Sebastian L. from the JSC team to get it done!
I am happy to report now that with #10340 applied I have run a series of tests (based on my originally posted sample code), without encountering the issue anymore.
Thanks for fixing this!
Concurrent bcast with derived datatype corrupts memory
Background information
In my application, I am running multiple broadcasts in parallel (on separate communicators and buffers, of course). Because in principle the buffers can become very large, I am using a derived datatype (MPI_Type_contiguous) for this operation. After experiencing spurious segmentation faults, I turned on the address sanitizer and notized that the issue is actually an attempted double free within OpenMPI. Below I provide a stack trace and a small self-contained code sample reproducing the issue.
When I instead don't use the derived datatype and just go with MPI_DOUBLE, the problem does not occur. I therefore believe this is likely a bug.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.2 UCX 1.11.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Deployed as module on HPC cluster.
Please describe the system on which you are running
JSC Booster, see https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html.
Details of the problem
A stack trace of the problem looks like this:
This was generated by the following sample program (running on 16 ranks):
Compile with: