Hang in one-sided communications across multiple nodes

victor-anisimov commented 2 months ago

Once-sided communications between GPU pointers intermittently hang on Aurora. The attached reproducer runs on 96 nodes using six 16-node subcommunicators. One needs to run about 10-50 jobs in order to catch the hang. Successful job contains "All Done" in the output file. If that string is absent after the time runs out, it indicates the hang. The job requires 2-3 minutes of walltime to complete. The job that does not finish in 5 min most likely hangs and won't progress.

hzhou commented 2 months ago

@victor-anisimov I built a debug version of mpich on Aurora at /home/hzhou/pull_requests/mpich-debug/_inst. Could you try link with that and run the reproducer while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1? The goal is to generate a backtrace as shown in https://github.com/pmodels/mpich/pull/7084

I am not sure the easiest way to switch libmpi.so. Try

export LD_LIBRARY_PATH=/home/hzhou/pull_requests/mpich-debug/_inst/lib:$LD_LIBRARY_PATH

and verify with ldd -- [EDIT: that likely won't work due to so version mismatch]. If that doesn't work, try

LD_PRELOAD=/home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0

victor-anisimov commented 2 months ago

Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing module unload mpich export PATH=${PATH}:/home/hzhou/pull_requests/mpich-debug/_inst/bin mpif90 -fc=ifx -O0 -g -i4 -r8 -what -fiopenmp -fopenmp-targets=spir64 test.F90 ldd ./a.out libmpifort.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpifort.so.0 (0x00007f017ddf7000) libmpi.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0 (0x00007f017b486000) ./bulk.sh

The test job crashed at initialization with the error message: forrtl: severe (71): integer divide by zero Image PC Routine Line Source libpthread-2.31.s 00001460507998C0 Unknown Unknown Unknown a.out 000000000040C7E2 Unknown Unknown Unknown a.out 000000000040C47D Unknown Unknown Unknown libc-2.31.so 00001460505C324D __libc_start_main Unknown Unknown a.out 000000000040C3AA Unknown Unknown Unknown

I guess the MPICH build configuration causes this crash. Could you take the reproducer, which is very simple, and try to run the test?

reproducer6.tgz

hzhou commented 2 months ago

module unload mpich

Make sure to set PALS_PMI=pmix if you unload mpich module.

while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1

Make it MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=10. 1 second timeout was too short. 10 second seems to be sufficient for normal progress without hung.

Could you take the reproducer,

Running 100 loops now (or until my 2 hour walltime kicks me out), will update.

hzhou commented 2 months ago

I think I got some timeouts, need confirm with larger MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT -

...
Rank 0 conducted      1148200 one-sided messages
Rank 0 conducted      1148300 one-sided messages
3 pending requests in pool 0
    ac000004: MPIDI_OFI_do_irecv: source=4, tag=14, data_sz=0
/home/hzhou/MPI/lib/libmpi.so.0(+0xc1bad4) [0x14ab09776ad4]
/home/hzhou/MPI/lib/libmpi.so.0(+0x954a0b) [0x14ab094afa0b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x9557cd) [0x14ab094b07cd]
/home/hzhou/MPI/lib/libmpi.so.0(+0x955167) [0x14ab094b0167]
/home/hzhou/MPI/lib/libmpi.so.0(+0x802a58) [0x14ab0935da58]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6bf95b) [0x14ab0921a95b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6d1866) [0x14ab0922c866]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7977d8) [0x14ab092f27d8]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797aa3) [0x14ab092f2aa3]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7caee0) [0x14ab09325ee0]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7ca90e) [0x14ab0932590e]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7cac96) [0x14ab09325c96]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797f97) [0x14ab092f2f97]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797c72) [0x14ab092f2c72]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4db4dc) [0x14ab090364dc]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da580) [0x14ab09035580]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da3f9) [0x14ab090353f9]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da1cc) [0x14ab090351cc]
/home/hzhou/MPI/lib/libmpi.so.0(MPI_Win_fence+0x19) [0x14ab09034e69]
/home/hzhou/MPI/lib/libmpifort.so.0(mpi_win_fence_f08_+0x2d) [0x14ab0b5898ad]
./a.out() [0x412013]
./a.out() [0x40c4fd]
/lib64/libc.so.6(__libc_start_main+0xef) [0x14ab0782d24d]
./a.out() [0x40c42a]
...
Abort(546498063) on node 864: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
Abort(546498063) on node 876: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
...

It is stuck in a internal Barrier.

hzhou commented 2 months ago

 grep 'All Done' job*.out
job10.out:All Done
job11.out:All Done
job12.out:All Done
job13.out:All Done
job14.out:All Done
job16.out:All Done
job17.out:All Done
job18.out:All Done
job19.out:All Done
job1.out:All Done
job21.out:All Done
job22.out:All Done
job25.out:All Done
job26.out:All Done
job27.out:All Done
job28.out:All Done
job29.out:All Done
job2.out:All Done
job30.out:All Done
job31.out:All Done
job32.out:All Done
job33.out:All Done
job34.out:All Done
job35.out:All Done
job36.out:All Done
job37.out:All Done
job38.out:All Done
job39.out:All Done
job3.out:All Done
job40.out:All Done
job41.out:All Done
job42.out:All Done
job4.out:All Done
job6.out:All Done
job7.out:All Done
job8.out:All Done
job9.out:All Done
job.out:All Done

Timed out in job 5, 23, and 24.

victor-anisimov commented 2 months ago

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

raffenet commented 2 months ago

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

Previous investigation was that one of the target processes somehow gets overwhelmed by MPI_GET requests and basically stops being able to send responses to the origin. We suspected a libfabric/cxi issue, but it was not clear why it would hang rather than report an error and crash. It's possible that #7117 may improve the error reporting in this case.

hzhou commented 2 months ago

Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue.

victor-anisimov commented 2 months ago

I tried the latest commit b3480ddfec1d9e98b06783aec97c082eadeca1a7, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of the test. All 8 jobs got killed by the scheduler due to exceeding the walltime limit. None of the 8 hanging jobs include any error or warning messages from MPICH. My conclusion is that either I do not know how to properly run the MPICH to get those forward progress errors printed out or the fixes do not work for the hang in one-sided communications. Could @hzhou or @raffenet try using the latest MPICH with the one-sided bug reproducer to see if the forward progress fixes work for the present case?

hzhou commented 2 months ago

The one built by Colleen doesn't have --enable-g=progress on I believe. We suspect your case is due to the underlying network cannot handle certain congestion situations yet not raising an error. At this point, it is just guess. I need think about how to further narrow down the reproducer. Currently, a 1/20 chance thus hours of time for a single reproducer is a bit difficult to work with.

colleeneb commented 2 months ago

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

hzhou commented 2 months ago

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

@colleeneb Only do that for debug build (vs. performance build). For debug build, you may also want to use --enable-g=dbg,log,progress

iziemba commented 1 week ago

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

raffenet commented 1 week ago

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

@iziemba Invalid in what way? I see that user-provided keys are limited to 4-bytes in the fi_cxi manpage. It does look like MPICH could generate a key larger than that the way to code is setup today. But looking at the error checks, I would expect MPICH to get -FI_EKEYREJECTED from fi_mr_reg and return it to the user rather than proceed and hang.

raffenet commented 1 week ago

I tried adding a dummy registration to my MPICH build just to see what happens. Unless I am missing something, it looks like libfabric will not complain about the invalid key, so we need to implement some verification inside MPICH itself. Here is the code for the registration I tried:

    int ret;
    uint64_t key = UINT64_MAX;
    void *buf = MPL_malloc(16, MPL_MEM_OTHER);
    struct fid_mr *mr;
    ret = fi_mr_reg(MPIDI_OFI_global.ctx[0].domain,
                   buf, 16,
                   FI_REMOTE_READ, 0,
                   key, 0, &mr, NULL);
    printf("ret = %d, %s\n", ret, fi_strerror(ret));

and here is the output 😕

ret = 0, Success

raffenet commented 1 week ago

More info after building my own libfabric and adding debug printfs, the provider thinks the requested key is a provider key for some reason and therefore thinks it is valid. If I enable MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1 then the provider actually skips the validation step altogether because of the domain being setup to use prov keys exclusively, I believe.

raffenet commented 1 week ago

So when the user passes in an invalid requested key, it may accidentally cause cxi to think it is a provider key and take the cxip_key.is_prov branch and somehow pass the validation there...

bool cxip_generic_is_valid_mr_key(uint64_t key)
{
        struct cxip_mr_key cxip_key = {
                .raw = key,
        };

        if (cxip_key.is_prov)
                return cxip_is_valid_prov_mr_key(key);

        return cxip_is_valid_mr_key(key);
}

raffenet commented 1 week ago

After looking closer, I no longer thing MPICH will generate a key larger than 4 bytes in this scenario. Must be something else.

iziemba commented 1 week ago

For user keys, I can update CXI provider to have check for RKEY size and ensure it is <= 4 bytes.

FI_MR_PROV_KEY key size is 8 bytes. When running with MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1, I verified that libfabric CXI provider was returning expected generated RKEYs. Then, I see MPICH use an invalid key. Example:

libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4f40f70 (key: 0xA000DD743707002F)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c37c10 (key: 0xA000EF2705D8004D)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894a10 (key: 0xA000A22111480016)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x58947b0 (key: 0xA000A2211149001B)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367ec40 (key: 0xA000EF2705D9003C)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367e870 (key: 0xA000EF2705DA0028)
libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4e62670 (key: 0xA000DD7437080029)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894c70 (key: 0xA000A221114A0024)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894ed0 (key: 0xA000A221114B0037)
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x50c8b30 (key: 0xA000DD7437090027)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c8df10 (key: 0xA000EF2705DB0040)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x436bdf0 (key: 0xA000EF2705DC0036)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5895390 (key: 0xA000A221114C0009)

raffenet commented 1 week ago

libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441

Hm, OK. The warning is from the RMA origin side. So we need to trace keys from creation at the target -> usage at the origin and see if there's corruption or something.

hzhou commented 3 days ago

Potential work around in #7202

pmodels / mpich

Hang in one-sided communications across multiple nodes #7118