open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

Issue with atomics on arm64 #12011

Closed devreal closed 4 months ago

devreal commented 9 months ago

Background information

This is an issue based on the discussion in https://github.com/open-mpi/ompi/pull/11999.

From https://github.com/open-mpi/ompi/pull/11999 it looks like we are missing a release memory barrier somewhere in the code. The problem is solved by adding release semantics to the store in the CAS. However, we generally use relaxed memory ordering in atomic operations so the fix proposed is not the right one.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Based on https://github.com/open-mpi/ompi/pull/12005 (port to 4.1.x) this issue seems to be present in 4.1.x and we should assume that it is present in master and 5.0.x as well.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

Reproducer:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include "mpi.h"

#define MAX_THREADS (20)

int g_rankSize = 0;
int g_rank = 0;
MPI_Comm g_comm[MAX_THREADS];

void *mpi_thread(void* p)
{
    int id = *(int*)p;
    free(p);
    int i;
    int count = 0;
    for (i = 0; i < 1000000; ++i) {
        int s = 1;
        int r = 0;
        MPI_Allreduce(&s, &r, 1, MPI_INT, MPI_SUM, g_comm[id]);
        if (r != g_rankSize) {
            count++;
        }
    }
    printf("rank %d id %d error count = %d\n", g_rank, id, count);
    return NULL;
}

int main(int argc, char** argv)
{
    int mpi_threads_provided;
    int req = MPI_THREAD_MULTIPLE;
    pthread_t threads[MAX_THREADS];
    const int threadNum = 10;
    int64_t i;

    MPI_Init_thread(&argc, &argv, req, &mpi_threads_provided);
    MPI_Comm_rank(MPI_COMM_WORLD, &g_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &g_rankSize);

    MPI_Group worldGroup;
    MPI_Comm_group(MPI_COMM_WORLD, &worldGroup);
    for (i = 0; i < threadNum; ++i) {
        MPI_Comm_create(MPI_COMM_WORLD, worldGroup, &g_comm[i]);
    }

    for (i = 0; i < threadNum; ++i) {
        int *p = (int*)malloc(sizeof(int));
        *p = (int)i;
        pthread_create(&threads[i], NULL, mpi_thread, (void*)p);
    }

    for (i = 0; i < threadNum; ++i) {
        pthread_join(threads[i], NULL);
    }
    MPI_Finalize();
    return 0;
}

It either yields wrong results or crashes.

lrbison commented 9 months ago

@yuncliu I am attempting to reproduce. I am using AWS hpc7g.16xlarge instances.

BogoMIPS        : 2100.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd40
CPU revision    : 1

Using 4 hosts each running 4 tasks, each task using 16 threads. In order to use these atomics I've compiled with an external pmix and without c11 or gcc internal atomics:

./configure --disable-builtin-atomics --disable-c11-atomics

I'll let it run in a loop and monitor for failure. How many hosts did you use to find this issue?

lrbison commented 9 months ago

@yuncliu After 500 executions I still did not observe the original crash. Any other specifics about your setup you can share?

lrbison commented 8 months ago

From issue number #11999:

My hardware is a server with 192 arm64 core and 4 numa node.

This is one difference. I only have access to single-socket arm cores.

yuncliu commented 7 months ago

Maybe the problem is in opal_atomic_compare_exchange_strong_32 and opal_atomic_compare_exchange_strong_64 I disassemble the atomic_compare_exchange_strong in both gcc and clang the assemble is like

ldaxr 
cmp
b.ne
stlxr

And now in "opal/include/opal/sys/arm64/atomic.h" the function opal_atomic_compare_exchange_strong_32 is exactly same as opal_atomic_compare_exchange_strong_acq_32 and the same goes for opal_atomic_compare_exchange_strong_64 and opal_atomic_compare_exchange_strong_acq_64. I think the opal_atomic_compare_exchange_strong_32/64 and opal_atomic_compare_exchange_strong_acq_32/64 should have difference semantics。

So acordding to assemble of atomic_compare_exchange_strong the opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr.

lrbison commented 7 months ago

opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr.

I disagree. That would be opal_atomic_compare_exchange_strong_rel_32/64.

Lets compare opal_atomic_compare_exchange_strong_32 implementations:

stdc implementation:

atomic_compare_exchange_strong_explicit(addr, compare, value, memory_order_relaxed, memory_order_relaxed)

gcc-builtins:

__atomic_compare_exchange_n(addr, oldval, newval, false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);

arm64 asm:

ldaxr    %w0, [%2]
cmp     %w0, %w3
bne     2f
stxr    %w1, %w4, [%2]
cbnz    %w1, 1b

I've plugged those into godbolt and you can see none of them have STLXR, since none of them have release semantics.

However it does point out that the gcc-builtins and the atomic stdc implementation differ in their acquire behavior.

If adding release semantics does fix the issue, then we probably are missing a write barrier somewhere, and that's the real bug.

yuncliu commented 7 months ago

ucx is ok. The problem only happens in ob1

jsquyres commented 6 months ago

It's a new year! Any progress on this, perchance?

lrbison commented 6 months ago

@yuncliu I am planning to try another reproducer here, but I was not able to duplicate last time.

Can you provide some detail on how you configured Open MPI? Thanks!

lrbison commented 5 months ago

@yuncliu Are you using the smcuda btl? I have a reproducer in https://github.com/open-mpi/ompi/issues/12270 if so. Can you re-run your reproducer with --mca btl ^smcuda to disable that component, and see if the problem is still present?

lrbison commented 5 months ago

@yuncliu can you test your workload now that https://github.com/open-mpi/ompi/pull/12344 is merged to see if that fix addresses the issue you reported?

github-actions[bot] commented 4 months ago

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] commented 4 months ago

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!