Concurrent bcast with derived datatype corrupts memory

sekoenig commented 2 years ago

Concurrent bcast with derived datatype corrupts memory

Background information

In my application, I am running multiple broadcasts in parallel (on separate communicators and buffers, of course). Because in principle the buffers can become very large, I am using a derived datatype (MPI_Type_contiguous) for this operation. After experiencing spurious segmentation faults, I turned on the address sanitizer and notized that the issue is actually an attempted double free within OpenMPI. Below I provide a stack trace and a small self-contained code sample reproducing the issue.

When I instead don't use the derived datatype and just go with MPI_DOUBLE, the problem does not occur. I therefore believe this is likely a bug.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.2 UCX 1.11.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Deployed as module on HPC cluster.

Please describe the system on which you are running

JSC Booster, see https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html.

Details of the problem

A stack trace of the problem looks like this:

==25990==ERROR: AddressSanitizer: attempting double-free on 0x61200005bfc0 in thread T12:
    #0 0x492187 in __interceptor_free ../../../../libsanitizer/asan/asan_malloc_linux.cpp:127
    #1 0x151f8280cae6 in mca_pml_ucx_datatype_attr_del_fn (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ucx.so+0xaae6)
    #2 0x151f8a80099d in set_value (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0x3399d)
    #3 0x151f8a7ff445 in ompi_attr_set_c (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0x32445)
    #4 0x151f8280cc7c in mca_pml_ucx_init_datatype (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ucx.so+0xac7c)
    #5 0x151f828071b5 in mca_pml_ucx_recv (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ucx.so+0x51b5)
    #6 0x151f8a87bf13 in ompi_coll_base_bcast_intra_scatter_allgather (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0xaef13)
    #7 0x151f747dd1af in ompi_coll_tuned_bcast_intra_dec_fixed (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_coll_tuned.so+0x71af)
    #8 0x151f8a83c089 in PMPI_Bcast (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0x6f089)

This was generated by the following sample program (running on 16 ranks):

// concurrent_bcast.cpp

#include <iostream>
#include <vector>
#include <thread>

#include <mpi.h>

#include <tbb/parallel_for.h>

int main(void)
{
  int requested = MPI_THREAD_MULTIPLE, provided;

  MPI_Init_thread(nullptr, nullptr, requested, &provided);

  if (provided != requested)
  {
    std::cerr << "Failed to initialize MPI with full thread support!"
              << std::endl;
    exit(1);
  }

  int mr, nr;

  MPI_Comm_rank(MPI_COMM_WORLD, &mr);
  MPI_Comm_size(MPI_COMM_WORLD, &nr);

  const size_t dim = 1024 * 1024;

  const size_t chunk_size = 16;
  const size_t chunk_count = dim / chunk_size;

  std::vector<std::vector<double>> buffers;
  std::vector<MPI_Comm> comms;

  buffers.reserve(nr);
  comms.reserve(nr);

  for (int r = 0; r < nr; ++r)
  {
    buffers.emplace_back(dim, 0.0);

    auto &comm = comms.emplace_back();

    MPI_Comm_dup(MPI_COMM_WORLD, &comm);
  }

  MPI_Datatype chunk_type;

  MPI_Type_contiguous(chunk_size, MPI_DOUBLE, &chunk_type);
  MPI_Type_commit(&chunk_type);

  const int repeat = 100;

  for (int i = 0; i < repeat; ++i)
  {
    if (mr == 0) std::cout << "Pass = " << i << std::endl;

    tbb::parallel_for(0, nr, [&](int r)
    {
      if (r == mr) std::fill(
        buffers[r].begin(), buffers[r].end(), double(r)
      );

      MPI_Bcast(
        buffers[r].data(), chunk_count,
        chunk_type,
        r, comms[r]
      );

      /* No issue if this is used instead of the above:
      MPI_Bcast(
        buffers[r].data(), dim,
        MPI_DOUBLE,
        r, comms[r]
      );
      */
    });
  }

  for (auto &comm : comms) MPI_Comm_free(&comm);

  MPI_Type_free(&chunk_type);
  MPI_Finalize();
}

Compile with:

mpicxx -fsanitize=address -fno-omit-frame-pointer -g -c concurrent_bcast.cpp -o concurrent_bcast.o
mpicxx -fsanitize=address -static-libasan concurrent_bcast.o -l tbb -l pthread -o concurrent_bcast.run

ggouaillardet commented 2 years ago

That could be somehow specific to UCX. What if you mpirun --mca pml ^ucx ... ?

ggouaillardet commented 2 years ago

FWIW, here is a C+OpenMP version of the test program

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
  int requested = MPI_THREAD_MULTIPLE, provided;

  MPI_Init_thread(&argc, &argv, requested, &provided);

  if (provided != requested)
  {
    fprintf(stderr, "Failed to initialize MPI with full thread support!\n");
    MPI_Abort(MPI_COMM_WORLD, 1);
  }

  int mr, nr;

  MPI_Comm_rank(MPI_COMM_WORLD, &mr);
  MPI_Comm_size(MPI_COMM_WORLD, &nr);

  const size_t dim = 1024 * 1024;

  const size_t chunk_size = 16;
  const size_t chunk_count = dim / chunk_size;

  double *buffers[nr];
  MPI_Comm comms[nr];

  for (int r = 0; r < nr; ++r)
  {
    buffers[r] = (double *)calloc (dim, sizeof(double));
    MPI_Comm_dup(MPI_COMM_WORLD, comms+r);
  }

  MPI_Datatype chunk_type;

  MPI_Type_contiguous(chunk_size, MPI_DOUBLE, &chunk_type);
  MPI_Type_commit(&chunk_type);

  const int repeat = 100;

  for (int i = 0; i < repeat; ++i)
  {
    if (mr == 0) printf( "Pass = %d\n", i);

    #pragma omp parallel for schedule(static)
    for (int r=0; r<nr; r++) 
    {
      if (r == mr) for(int i=0; i<dim; i++) buffers[r][i] = (double)r;
      printf("%d: %d/%d\n", mr, r, nr);

#if 1
      MPI_Bcast(
        buffers[r], chunk_count,
        chunk_type,
        r, comms[r]
      );

#else
      /* No issue if this is used instead of the above: */
      MPI_Bcast(
        buffers[r], dim,
        MPI_DOUBLE,
        r, comms[r]
      );
#endif
    }
  }
printf("x\n");

  for (int r=0; r<nr; r++) {
     MPI_Comm_free(comms+r);
  }

  MPI_Type_free(&chunk_type);
  MPI_Finalize();
}

ggouaillardet commented 2 years ago

from pml_ucx_datatype.h:

#ifdef HAVE_UCP_REQUEST_PARAM_T
__opal_attribute_always_inline__
static inline pml_ucx_datatype_t*
mca_pml_ucx_get_op_data(ompi_datatype_t *datatype)
{
    pml_ucx_datatype_t *ucp_type = (pml_ucx_datatype_t*)datatype->pml_data;

    if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID)) {
        return ucp_type;
    }

    mca_pml_ucx_init_datatype(datatype);
    return (pml_ucx_datatype_t*)datatype->pml_data;
}

this is not thread safe: mca_pml_ucx_init_datatype() should not be called on the same datatype by two concurrent threads.

ggouaillardet commented 2 years ago

@yosefe can you please have a look?

sekoenig commented 2 years ago

That could be somehow specific to UCX. What if you mpirun --mca pml ^ucx ... ?

I tried with export OMPI_MCA_pml=^ucx because I need to use SLURM (srun) instead of mpirun. Unfortunately I cannot really test because now I get Open MPI failed to TCP connect to a peer MPI process.. Not sure if there's a way around UCX on the cluster I am using.

ggouaillardet commented 2 years ago

You can try to restrict to a TCP network known to work, for example

export OMPI_MCA_btl_tcp_if_include=192.168.0.0/24

sekoenig commented 2 years ago

Thanks for the suggestion and for investigating the issue! I think it's best I open a ticket tomorrow with the cluster support team to help me test this. I will do that tomorrow and report the result here.

sekoenig commented 2 years ago

With the help of the HPC support team I was able to run the test code now without UCX. The specific settings I used are

export OMPI_MCA_pml='^ucx'
export OMPI_MCA_btl='^uct'

Interestingly, using the contiguous type still causes a problem, but now it's different:

Pass = 0
Pass = 1
AddressSanitizerAddressSanitizer:DEADLYSIGNAL
:DEADLYSIGNAL
=================================================================
[jwb0149:19603:0:19632] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa0)
==19603==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x14cfe9a8f933 bp 0x14cf8fb13ff0 sp 0x14cf8fb13f00 T11)
==19603==The signal is caused by a READ memory access.
==19603==Hint: address points to the zero page.
    #0 0x14cfe9a8f933 in mca_rcache_grdma_register (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_rcache_grdma.so+0x2933)
    #1 0x14cfe9a17bd0 in mca_btl_openib_register_mem (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_btl_openib.so+0xcbd0)
    #2 0x14cfa466adc0 in mca_pml_ob1_rdma_btls (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ob1.so+0x10dc0)
    #3 0x14cfa4668019 in mca_pml_ob1_send_request_start_seq (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ob1.so+0xe019)
    #4 0x14cfa466703e in mca_pml_ob1_send (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_pml_ob1.so+0xd03e)
    #5 0x14cff1ae967b in ompi_coll_base_sendrecv_actual (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0xb267b)
    #6 0x14cff1ae5cc2 in ompi_coll_base_bcast_intra_scatter_allgather (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0xaecc2)
    #7 0x14cfe91101af in ompi_coll_tuned_bcast_intra_dec_fixed (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_coll_tuned.so+0x71af)
    #8 0x14cff1aa6089 in PMPI_Bcast (/p/software/juwelsbooster/stages/2022/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40+0x6f089)

In particular, this only appears after one successful pass. (That could be coincidental though if we are dealing with some sort of race condition.)

On the other hand, running the comparison with MPI_DOUBLE, I get this:

Pass = 0
Pass = 1
Pass = 2
[[14795,14081],9][btl_openib_component.c:3689:handle_wc] from jwb0097.juwels to: jwb0097i error polling LP CQ with status REMOTE ACCE
SS ERROR status number 10 for wr_id 61e000002c98 opcode 128  vendor error 136 qp_idx 3
srun: error: jwb0097: tasks 8,10-11: Terminated

So it's a bit murky and perhaps the non-UCX test just crashed because I am using a non officially supported mode on the cluster. I recommend for the time being to focus on the issue I originally reported.

bosilca commented 2 years ago

The attribute setting is thread safe, but MPI allows overwriting an attribute, which means that rewriting the same attribute will force the deletion of the previous one, which translates into a UCX datatype being deleted by mca_pml_ucx_datatype_attr_del_fn while another thread is using it. The @open-mpi/ucx folks need to refcount their use of the internal datatype.
You can use IPoIB to restrict OMPI to only use the TCP over IP feature. Once IPoIB is configured you can mpirun with --mca pml ob1 --mca btl self,tcp,sm --mca btl_tcp_if_include x.y.x.t/s.

ggouaillardet commented 2 years ago

@bosilca my point was it could be better to avoid overwritting datatype->pml_data with (a pointer to a) similar data in the first place, and hence avoiding to concurrently set the datatype.

FWIW, here is a quick and dirty proof of concept that seems to fix the issue in my environment

diff --git a/ompi/mca/pml/ucx/pml_ucx_datatype.h b/ompi/mca/pml/ucx/pml_ucx_datatype.h
index 8e1fbba..97653d1 100644
--- a/ompi/mca/pml/ucx/pml_ucx_datatype.h
+++ b/ompi/mca/pml/ucx/pml_ucx_datatype.h
@@ -14,6 +14,7 @@

 #define PML_UCX_DATATYPE_INVALID   0
+#define PML_UCX_DATATYPE_PENDING   1

 #ifdef HAVE_UCP_REQUEST_PARAM_T
 typedef struct {
@@ -49,9 +50,17 @@ static inline ucp_datatype_t mca_pml_ucx_get_datatype(ompi_datatype_t *datatype)
 #ifdef HAVE_UCP_REQUEST_PARAM_T
     pml_ucx_datatype_t *ucp_type = (pml_ucx_datatype_t*)datatype->pml_data;

-    if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID)) {
+    if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID && (int64_t)ucp_type != PML_UCX_DATATYPE_PENDING)) {
         return ucp_type->datatype;
     }
+    int64_t oldval = PML_UCX_DATATYPE_INVALID;
+    if (opal_atomic_compare_exchange_strong_64((int64_t *)&datatype->pml_data, &oldval, PML_UCX_DATATYPE_PENDING)) {
+        ucp_datatype_t res =  mca_pml_ucx_init_datatype(datatype);
+        return res;
+    } else {
+        while(PML_UCX_DATATYPE_PENDING == datatype->pml_data);
+        return (ucp_datatype_t)datatype->pml_data;
+    }
 #else
     ucp_datatype_t ucp_type = datatype->pml_data;

@@ -70,11 +79,16 @@ mca_pml_ucx_get_op_data(ompi_datatype_t *datatype)
 {
     pml_ucx_datatype_t *ucp_type = (pml_ucx_datatype_t*)datatype->pml_data;

-    if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID)) {
+    if (OPAL_LIKELY(ucp_type != PML_UCX_DATATYPE_INVALID && (int64_t)ucp_type != PML_UCX_DATATYPE_PENDING)) {
         return ucp_type;
     }
+    int64_t oldval = PML_UCX_DATATYPE_INVALID;
+    if (opal_atomic_compare_exchange_strong_64((int64_t *)&datatype->pml_data, &oldval, PML_UCX_DATATYPE_PENDING)) {
+        mca_pml_ucx_init_datatype(datatype);
+    } else {
+        while(PML_UCX_DATATYPE_PENDING == datatype->pml_data);
+    }

-    mca_pml_ucx_init_datatype(datatype);
     return (pml_ucx_datatype_t*)datatype->pml_data;
 }

bosilca commented 2 years ago

You're right, preventing concurrent access to mca_pml_ucx_init_datatype can be done with an atomic operation in mca_pml_ucx_get_datatype. Nice patch !

sekoenig commented 2 years ago

Naïve question based on my superficial understanding of things: would it not make sense to do this kind of initialization within the implementation of MPI_Type_commit?

bosilca commented 2 years ago

You could indeed do it during MPI_Type_commit. If not all committed types are used for point-to-point communications you would waste some memory, but in exchange you are getting rid of the atomic construct to protect the datatype creation.

sekoenig commented 2 years ago

You could indeed do it during MPI_Type_commit. If not all committed types are used for point-to-point communications you would waste some memory, but in exchange you are getting rid of the atomic construct to protect the datatype creation.

Sounds to me like creating a large number of unused types is something that could/should be addressed at the application level, while getting rid of this initialization within the communication routines could even bring a (presumably small) performance improvement.

But admittedly I do not know the history of/reasoning for the current design, and ultimately I am happy to see it fixed in any way :)

ggouaillardet commented 2 years ago

Currently, there is no mechanism to invoke a (pml specific) callback in MPI_Type_commit().

This is something we should at least (carefully) consider, but since that would likely break internal ABI, that is unlikely to happen anytime soon.

sekoenig commented 2 years ago

2. You can use IPoIB to restrict OMPI to only use the TCP over IP feature. Once IPoIB is configured you can mpirun with `--mca pml ob1 --mca btl self,tcp,sm --mca btl_tcp_if_include x.y.x.t/s`.

I finally managed to try this with more help from the HPC support team. I had to replace sm with vader because otherwise I get As of version 3.0.0, the "sm" BTL is no longer available in Open MPI., but then with the proper TCP network info set my test program completes fine, with contiguous derived datatype and with plain MPI_DOUBLE.

The Address Sanitizer reports a couple of (supposed) memory leaks that may be worth following up on some time, but for now I'll leave it.

hoopoepg commented 2 years ago

hi @sekoenig thank you for bug report.

sorry for delay in replay - we are in release process

could you test this PR: https://github.com/open-mpi/ompi/pull/10298 there added lock on UCX datatype manipulation which may help to resolve issue

thank you again and sorry for delay

sekoenig commented 2 years ago

Thanks a lot for following up and fixing this! I cannot test easily because I am working with the OpenMPI installation deployed on the HPC cluster I am using. I will check with the support team there if I might possibly compile a custom OpenMPI version within my home folder.

hoopoepg commented 2 years ago

@sekoenig the simplest way to test is build UCX with this PR and use LD_PRELOAD=libucp.so variable to replace system UCX

sekoenig commented 2 years ago

Okay, that might be doable without too much trouble. I will give it a shot.

sekoenig commented 2 years ago

Question after a conversation with the HPC support team: is is really UCX (libucp.so) that I should be building with the PR merged? They pointed out it seems I rather need a new mca_pml_ucx.so, which would mean building a custom OpenMPI. It may well be possible to get that set up.

hoopoepg commented 2 years ago

uuups, sorry... of-course you need PML UCX updated, not UCX. my fault. sorry for that. UCX is not updated here

sekoenig commented 2 years ago

I have been trying together with the HPC support stuff, but unfortunetely we cannot get it to work. The problem is that we need to manually compile OpenMPI 5.0, which does not work with the PMIX installation deployed on the cluster (there are compiler errors, presumably that version is too old). Using the PMIX that comes with the ompi source tree does not seem to be an option, because the cluster has PMIX integrated with their SLURM installation.

Would you be able perhaps to backport the fix to OpenMPI 4.1.x? That would be much easier to test, and presumably it would generally be good to have the bug fixed in older versions as well.

hoopoepg commented 2 years ago

sure, I will make PR tomorrow & let you know

sekoenig commented 2 years ago

Great, thank you!

hoopoepg commented 2 years ago

@sekoenig welcome to test: https://github.com/open-mpi/ompi/pull/10340

sekoenig commented 2 years ago

Sorry for the delay, it was more difficult than anticipated to set up a custom OpenMPI built on the cluster. Many thanks to Sebastian L. from the JSC team to get it done!

I am happy to report now that with #10340 applied I have run a series of tests (based on my originally posted sample code), without encountering the issue anymore.

Thanks for fixing this!

open-mpi / ompi