pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
541 stars 280 forks source link

Using malloc_shared with MPI_File_write_at_all on Intel GPUs #7030

Closed colleeneb closed 1 month ago

colleeneb commented 3 months ago

Hello,

This is to report an issue we are seeing with MPICH on Intel GPUs (related to an IOR issue from @pkcoff). A small reproducer is below. The code uses Intel SYCL's malloc_shared as a buffer to send to MPI_File_write_at_all. The code works fine with regular malloc. It also works fine on one node but crashes on 2 nodes with errors of "Abort(15) on node 1 (rank 1 in comm 496): Fatal error in internal_Issend: Other MPI error". Is it expected that we can't pass memory allocated with SYCL's malloc_shared as buffers to MPI I/O functions like MPI_File_write_at_all for multi-node jobs?

Reproducer

> cat t.cpp
#include <mpi.h>
#include <math.h>
#include <stdio.h>
#include <sycl/sycl.hpp>

int main(){
    MPI_Init(NULL, NULL);

    sycl::queue syclQ{sycl::gpu_selector_v };

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int numProcs;
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

    MPI_File outFile;
    MPI_File_open(
        MPI_COMM_WORLD, "test", MPI_MODE_CREATE | MPI_MODE_WRONLY,
        MPI_INFO_NULL, &outFile);

    // regular malloc like below works, malloc_shared fails   
    //    char *bufToWrite = (char*)malloc(sizeof(char)*4);  
    char *bufToWrite = (char*)sycl::malloc_shared<char>(4, syclQ);
    snprintf(bufToWrite, 4, "%3d", rank);
    printf("%s\n", bufToWrite);
    MPI_File_write_at_all(
                          outFile, rank * 3,
                          bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);

    MPI_File_close(&outFile);
    MPI_Finalize();
}
> mpicc -fsycl t.cpp
# run on two nodes, one rank per node
> mpirun -n 2 -ppn 1 ./a.out 

Expected output

It should run like:

> mpirun -n 2 -ppn 1 ./a.out
  1
  0

We expect it to run, since malloc_shared is accessible on the host. This works fine with 2 MPI ranks on 1 node as well.

Actual output

> mpirun -n 2 -ppn 1 ./a.out
  1
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
Abort(15) on node 1 (rank 1 in comm 496): Fatal error in internal_Issend: Other MPI error
  0
x1921c6s1b0n0.hostmgmt2000.cm.americas.sgi.com: rank 1 exited with code 15

Note that above was with the default ofZE_FLAT_DEVICE_HIERARCHY=FLAT. If we use ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE is also fails:

>  mpirun -n 2 -ppn 1 ./a.out
free(): invalid pointer
x1921c6s1b0n0.hostmgmt2000.cm.americas.sgi.com: rank 1 died from signal 6
x1921c5s5b0n0.hostmgmt2000.cm.americas.sgi.com: rank 0 died from signal 15
raffenet commented 3 months ago

Is it expected that we can't pass memory allocated with SYCL's malloc_shared as buffers to MPI I/O functions like MPI_File_write_at_all for multi-node jobs?

Since this pointer is accessible from the host, I would expect MPICH to work with it no problem. We'll take a look...

raffenet commented 3 months ago

I'm unable to reproduce on 2 nodes of Sunspot so far with the default MPICH or with an upstream build. Is the bad output from Aurora?

(base) raffenet@x1922c6s1b0n0:~/proj/mpich-ze> cat foo.cpp
#include <mpi.h>
#include <math.h>
#include <stdio.h>
#include <sycl/sycl.hpp>

int main(){
    MPI_Init(NULL, NULL);

    sycl::queue syclQ{sycl::gpu_selector_v };

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int numProcs;
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

    MPI_File outFile;
    MPI_File_open(
        MPI_COMM_WORLD, "test", MPI_MODE_CREATE | MPI_MODE_WRONLY,
        MPI_INFO_NULL, &outFile);

    // regular malloc like below works, malloc_shared fails
    //    char *bufToWrite = (char*)malloc(sizeof(char)*4);
    char *bufToWrite = (char*)sycl::malloc_shared<char>(4, syclQ);
    snprintf(bufToWrite, 4, "%3d", rank);
    printf("%s\n", bufToWrite);
    MPI_File_write_at_all(
                          outFile, rank * 3,
                          bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);

    MPI_File_close(&outFile);
    MPI_Finalize();
    return 0;
}
(base) raffenet@x1922c6s1b0n0:~/proj/mpich-ze> which mpicc
/opt/aurora/24.086.0/CNDA/mpich/20231026/mpich-ofi-all-icc-default-pmix-gpu-drop20231026/bin/mpicc
(base) raffenet@x1922c6s1b0n0:~/proj/mpich-ze> mpicc -fsycl foo.cpp
(base) raffenet@x1922c6s1b0n0:~/proj/mpich-ze> mpiexec -n 2 -ppn 1 ./a.out
  1
  0
(base) raffenet@x1922c6s1b0n0:~/proj/mpich-ze> mpiexec -n 2 -ppn 1 hostname
x1922c6s1b0n0
x1922c6s2b0n0
colleeneb commented 3 months ago

Thanks a lot for testing!

Yes, the bad output is from Aurora. I just checked on Sunspot as well, and I realized that it does seem to work ok from my /home/! But from /lus/gila/projects/ it was failing for me on Sunspot like on Aurora. I double checked on Aurora, and it fails for me in /home/ as well as /lus/flare/projects/. Odd that it's filesystem dependent -- could you try on Aurora or somewhere in /lus/gila/projects/ on Sunspot to check if what I'm seeing is reproducible (and that I'm not crazy :) )?

raffenet commented 3 months ago

OK, I see the issue when running in /lus/gila/proj. Will update when I know more.

raffenet commented 3 months ago

I captured this backtrace using the Intel debug library. Somthing is going wrong down in the libfabric layer related to memory registration. So far I have been unable to reproduce with upstream MPICH, but that might be due to different FI_HMEM settings. Will keep looking...

#0  0x00001481b5888c6b in raise () from /lib64/libc.so.6
#1  0x00001481b588a305 in abort () from /lib64/libc.so.6
#2  0x00001481b58cea97 in __libc_message () from /lib64/libc.so.6
#3  0x00001481b58d6b1a in malloc_printerr () from /lib64/libc.so.6
#4  0x00001481b58d85c4 in _int_free () from /lib64/libc.so.6
#5  0x00001481b40d49bd in ze_dev_unregister (handle=70029312) at src/hmem_ze.c:1111
#6  0x00001481b412d158 in cxip_do_unmap (cache=<optimized out>, entry=0x424fba0) at prov/cxi/src/cxip_iomm.c:165
#7  0x00001481b40f56d8 in util_mr_free_entry (entry=0x424fba0, cache=0x41b8230) at prov/util/src/util_mr_cache.c:111
#8  util_mr_cache_create (entry=0x7ffee1aa9d10, info=0x7ffee1aa9c80, cache=0x41b8230) at prov/util/src/util_mr_cache.c:325
#9  ofi_mr_cache_search (cache=cache@entry=0x41b8230, attr=attr@entry=0x7ffee1aa9d40, entry=entry@entry=0x7ffee1aa9d10) at prov/util/src/util_mr_cache.c:385
#10 0x00001481b412df76 in cxip_map_cache (md=0x7ffee1aa9de8, attr=0x7ffee1aa9d40, dom=0x41b4090) at prov/cxi/src/cxip_iomm.c:351
#11 cxip_map (dom=dom@entry=0x41b4090, buf=buf@entry=0x148196520000, len=len@entry=3, flags=flags@entry=0, md=md@entry=0x7ffee1aa9de8)
    at prov/cxi/src/cxip_iomm.c:560
#12 0x00001481b41189ff in cxip_txc_copy_from_hmem (txc=txc@entry=0x41e7ee8, hmem_md=<optimized out>, hmem_md@entry=0x0, dest=0x4920d00,
    hmem_src=0x148196520000, size=3) at ./prov/cxi/include/cxip.h:3050
#13 0x00001481b4124ed7 in cxip_send_buf_init (req=0x421a550) at prov/cxi/src/cxip_msg.c:5403
#14 cxip_send_common (txc=0x41e7ee8, tclass=<optimized out>, buf=0x148196520000, len=3, desc=desc@entry=0x0, data=1, dest_addr=0, tag=8796613115904,
    context=0x0, flags=33685504, tagged=true, triggered=false, trig_thresh=0, trig_cntr=0x0, comp_cntr=<optimized out>) at prov/cxi/src/cxip_msg.c:5551
#15 0x00001481b4125251 in cxip_tinjectdata (fid_ep=<optimized out>, buf=<optimized out>, len=<optimized out>, data=<optimized out>,
    dest_addr=<optimized out>, tag=<optimized out>) at prov/cxi/src/cxip_msg.c:5763
#16 0x00001481b6750e96 in fi_tinjectdata (ep=0x41b0640, buf=0x148196520000, len=3, data=1, dest_addr=0, tag=8796613115904)
    at /opt/intel/csr/ofi/sockets-dynamic/include/rdma/fi_tagged.h:149
#17 0x00001481b6769113 in MPIDI_OFI_send_normal (buf=0x148196520000, count=3, datatype=1275068685, cq_data=1, dst_rank=0, tag=0,
    comm=0x1481d26f3240 <MPIR_Comm_direct>, context_offset=0, addr=0x4166648, vci_src=0, vci_dst=0, request=0x7ffee1aaacd0, dt_contig=1, data_sz=3,
    dt_ptr=0x0, dt_true_lb=0, type=8796093022208, attr=...) at ./src/mpid/ch4/netmod/ofi/ofi_send.h:362
#18 0x00001481b6763640 in MPIDI_OFI_send (buf=0x148196520000, count=3, datatype=1275068685, dst_rank=0, tag=0, comm=0x1481d26f3240 <MPIR_Comm_direct>,
    context_offset=0, addr=0x4166648, vci_src=0, vci_dst=0, request=0x7ffee1aaacd0, noreq=0, syncflag=8796093022208, err_flag=MPIR_ERR_NONE)
    at ./src/mpid/ch4/netmod/ofi/ofi_send.h:635
#19 0x00001481b67410f9 in MPIDI_NM_mpi_isend (buf=0x148196520000, count=3, datatype=1275068685, rank=0, tag=0, comm=0x1481d26f3240 <MPIR_Comm_direct>,
    attr=8, addr=0x4166648, request=0x7ffee1aaacd0) at ./src/mpid/ch4/netmod/ofi/ofi_send.h:689
#20 0x00001481b63c6075 in MPIDI_NM_mpi_isend (buf=0x148196520000, count=3, datatype=1275068685, rank=0, tag=0, comm=0x1481d26f3240 <MPIR_Comm_direct>,
    attr=8, addr=0x4166648, req_p=0x7ffee1aaacd0) at ./src/mpid/ch4/netmod/include/netmod_impl.h:239
#21 0x00001481b63c5c70 in MPIDI_isend (buf=0x148196520000, count=3, datatype=1275068685, rank=0, tag=0, comm=0x1481d26f3240 <MPIR_Comm_direct>, attr=8,
    av=0x4166648, req=0x7ffee1aaacd0) at ./src/mpid/ch4/src/ch4_send.h:31
#22 0x00001481b63ca38e in MPID_Isend (buf=0x148196520000, count=3, datatype=1275068685, rank=0, tag=0, comm=0x1481d26f3240 <MPIR_Comm_direct>, attr=8,
    request=0x7ffee1aaacd0) at ./src/mpid/ch4/src/ch4_send.h:60
#23 0x00001481b63ca52a in MPID_Issend (buf=0x148196520000, count=3, datatype=1275068685, rank=0, tag=0, comm=0x1481d26f3240 <MPIR_Comm_direct>, attr=8,
    request=0x7ffee1aaacd0) at ./src/mpid/ch4/src/ch4_send.h:150
#24 0x00001481b634a937 in internal_Issend (buf=0x148196520000, count=3, datatype=1275068685, dest=0, tag=0, comm=-2080374784, request=0x4948a30)
    at src/binding/c/c_binding.c:59758
#25 0x00001481b6349c23 in PMPI_Issend (buf=0x148196520000, count=3, datatype=1275068685, dest=0, tag=0, comm=-2080374784, request=0x4948a30)
    at src/binding/c/c_binding.c:59815
#26 0x00001481b9cc89ef in ADIOI_LUSTRE_W_Exchange_data (fd=<optimized out>, buf=<optimized out>, write_buf=<optimized out>, flat_buf=<optimized out>,
    offset_list=<optimized out>, len_list=<optimized out>, send_size=<optimized out>, recv_size=<optimized out>, off=<optimized out>, size=<optimized out>,
    count=<optimized out>, start_pos=<optimized out>, buftype_is_contig=<optimized out>, striping_info=<optimized out>, others_req=<optimized out>,
    send_buf_idx=<optimized out>, curr_to_proc=<optimized out>, done_to_proc=<optimized out>, buftype_extent=<optimized out>, buf_idx=<optimized out>,
    sent_to_proc=<optimized out>, nprocs=<optimized out>, myrank=<optimized out>, contig_access_count=<optimized out>, hole=<optimized out>,
    iter=<optimized out>, srt_off=<optimized out>, srt_len=<optimized out>, srt_num=<optimized out>, error_code=<optimized out>)
    at adio/ad_lustre/ad_lustre_wrcoll.c:826
#27 ADIOI_LUSTRE_Exch_and_write (fd=0x4941d30, buf=<optimized out>, datatype=<optimized out>, nprocs=2, others_req=<optimized out>, my_req=<optimized out>,
    offset_list=<optimized out>, len_list=<optimized out>, striping_info=<optimized out>, buf_idx=<optimized out>, myrank=<optimized out>,
    contig_access_count=<optimized out>, error_code=<optimized out>) at adio/ad_lustre/ad_lustre_wrcoll.c:590
#28 ADIOI_LUSTRE_WriteStridedColl (fd=0x4941d30, buf=<optimized out>, count=<optimized out>, datatype=<optimized out>, file_ptr_type=<optimized out>,
    offset=<optimized out>, status=<optimized out>, error_code=<optimized out>) at adio/ad_lustre/ad_lustre_wrcoll.c:303
#29 0x00001481b9cc070d in MPIOI_File_write_all (fh=<optimized out>, offset=3, file_ptr_type=file_ptr_type@entry=100, buf=<optimized out>, count=3,
    datatype=1275068673, myname=0x1481d26f15d0 <PMPI_File_write_at_all.myname> "MPI_FILE_WRITE_AT_ALL", status=0x1) at mpi-io/write_all.c:166
#30 0x00001481b9cc0c25 in PMPI_File_write_at_all (fh=0x2, offset=140732684474368, buf=0x0, count=<optimized out>, datatype=-508913664, status=0x8)
    at mpi-io/write_atall.c:71
#31 0x0000000000402998 in main () at foo.cpp:27
raffenet commented 3 months ago

with libfabric logging enabled:

libfabric:114288:1718981160::cxi:mr:cxip_do_map():59<warn> x1921c0s6b0n0: ZE device memory not supported. Try disabling implicit scaling (EnableImplicitScaling=0 NEOReadDebugKeys=1).
libfabric:114288:1718981160::cxi:mr:cxip_do_map():59<warn> x1921c0s6b0n0: ZE device memory not supported. Try disabling implicit scaling (EnableImplicitScaling=0 NEOReadDebugKeys=1).
free(): invalid pointer
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com: rank 1 died from signal 6 and dumped core
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com: rank 0 died from signal 15
raffenet commented 3 months ago

The I/O portion of this is not related to the failure. You can trigger simply by attempting to send the malloc_shared buffer. IMO, there are 2 issues:

  1. MPICH should be maybe be packing this data before passing to libfabric? Because it is known that device allocations, which includes malloc_shared are not supported.
  2. libfabric is not properly initializing the memory registration handle when registration fails for the malloc_shared buffer. The crash comes during libfabric mr cache cleanup, which looks like its trying to unmap an invalid handle.
raffenet commented 3 months ago

Here is a reproducer without the I/O:

#include <mpi.h>
#include <math.h>
#include <stdio.h>
#include <sycl/sycl.hpp>

int main(void){
    MPI_Init(NULL, NULL);

    sycl::queue syclQ{sycl::gpu_selector_v };

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
      char *buf = (char*)sycl::malloc_shared<char>(3, syclQ);
      MPI_Send(buf, 3, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
    } else if (rank == 1) {
      char foo[3];
      MPI_Recv(foo, 3, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }

    MPI_Finalize();
    return 0;
}
raffenet commented 1 month ago

After discussing with @colleeneb I took another look at this and I can indeed reproduce the crash with MPICH main and 4.2.x. I tried the patch in #7042 and it didn't make the crash go away. I must have messed up my environment during testing it initially. Will update with more info as I collect it.

raffenet commented 1 month ago

This patch fixes the issue in the example program. I'll push a commit onto Alex's branch.

diff --git a/src/mpid/ch4/netmod/ofi/ofi_send.h b/src/mpid/ch4/netmod/ofi/ofi_send.h
index 7aa14efce..61316c364 100644
--- a/src/mpid/ch4/netmod/ofi/ofi_send.h
+++ b/src/mpid/ch4/netmod/ofi/ofi_send.h
@@ -600,7 +600,7 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_send(const void *buf, MPI_Aint count, MPI
     if (MPIR_CVAR_CH4_OFI_ENABLE_INJECT && !syncflag && dt_contig &&
         (data_sz <= MPIDI_OFI_global.max_buffered_send)) {
         MPI_Aint actual_pack_bytes = 0;
-        if (attr.type == MPL_GPU_POINTER_DEV && data_sz) {
+        if (MPL_gpu_query_pointer_is_dev(send_buf, &attr) && data_sz) {
             MPIDI_OFI_register_am_bufs();
             if (!MPIDI_OFI_ENABLE_HMEM) {
                 /* Force pack for GPU buffer. */
colleeneb commented 1 month ago

Oh, great! Does it need to be on top of #7042 (Alex's branch?) to work, or is it ok on top of main too?

raffenet commented 1 month ago

Oh, great! Does it need to be on top of #7042 (Alex's branch?) to work, or is it ok on top of main too?

On top of main is fine, but you should pick both patches in any case. The check is needed in multiple places depending on message size. I'll rebase #7042 and run it through our Jenkins tests in hopes of merging later today.