MPI_File_write_at with a GPU device buffer on Intel GPUs

colleeneb commented 1 week ago

Hello,

This is to report an issue we are seeing with MPICH on Intel GPUs (related to an IOR issue from @pkcoff).

If we run a code (reproducer below) which calls MPI_File_write_at with a GPU device buffer, the code does not write to a file. It works fine if we use the host buffer.

Thanks! Let us know if this is expected or we're doing something wrong.

Reproducer

> cat t.cpp
#include <mpi.h>
#include <math.h>
#include <stdio.h>
#include <sycl/sycl.hpp>
#include <filesystem>
#define MESSAGE_SIZE 4

int main(){
    MPI_Init(NULL, NULL);

    sycl::queue syclQ{sycl::gpu_selector_v };

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int numProcs;
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

    MPI_File outFile;
    MPI_File_open(
        MPI_COMM_WORLD, "test", MPI_MODE_CREATE | MPI_MODE_WRONLY | MPI_MODE_EXCL,
        MPI_INFO_NULL, &outFile);
    MPI_Status status;
    char *bufToWrite_host = (char*)malloc(sizeof(char)*MESSAGE_SIZE);
    char *bufToWrite_device = sycl::malloc_device<char>(MESSAGE_SIZE, syclQ);
    snprintf(bufToWrite_host, MESSAGE_SIZE, "%3d", rank);
    printf("%s\n", bufToWrite_host);

    syclQ.memcpy( bufToWrite_device, bufToWrite_host, sizeof(char)*MESSAGE_SIZE).wait();
    MPI_File_write_at(
                          outFile, rank * MESSAGE_SIZE,
                          bufToWrite_device, MESSAGE_SIZE, MPI_CHAR, &status);

    if(status.MPI_ERROR != MPI_SUCCESS) {
      printf( "FAIL %d\n", status.MPI_ERROR );
      return 1;
    }

    MPI_File_close(&outFile);

    MPI_Barrier(MPI_COMM_WORLD);

    if( rank == 0 ) {

      std::filesystem::path p{"test"};

      std::cout << "The size of " << p.u8string() << " is " <<
        std::filesystem::file_size(p) << " bytes.\n";
      if( std::filesystem::file_size(p) == 0 ) {
        std::cout << "This is incorrect" << std::endl;
        return 1;
      }
    }

    MPI_Finalize();
    return 0;
}
> rm test # removing the output file if it's there
> mpicc -fsycl t.cpp
> mpirun -n 1 ./a.out

Expected Output

We expect the code to produce a file called "test" which has a size of 4 bytes. The code checks the size and prints it:

> mpirun -n 1 ./a.out
  0
The size of test is 4 bytes.

We get this output if we send the host buffer to the MPI call.

Actual Output

It does not put results in the file:

> mpirun -n 1 ./a.out
  0
The size of test is 0 bytes.
This is incorrect
x4214c3s3b0n0.hostmgmt2214.cm.aurora.alcf.anl.gov: rank 0 exited with code 1

raffenet commented 5 days ago

The ROMIO component of MPICH is not currently GPU-aware, so I'm surprised this code doesn't just crash 😕. We have a issue to add support for this kind of usage, but it not actively being worked on. We could raise the priority if it is desirable for Aurora.

pkcoff commented 5 days ago

@raffenet I don't have any actual user requests for this, but I would imagine if a user was doing MPI with gpu buffers to avoid the overhead of copying back to the host they would also want to keep the data on the gpu when doing MPI-IO, so I would say it would make sense to prioritize this support.

roblatham00 commented 5 days ago

@pkcoff I was talking about this with my student earlier this week... I think we can combine MPICH's GPU-aware-ness with ROMIO's two phase buffering and get GPU awareness for free in the collective i/o case. In a sense, ROMIO is packing/unpacking into its intermediate buffer.

file i/o occurs to/from the "cb_buffer_size" buffer, but data exchange among the processes happens with MPI point to point messaging which are already able to handle device memory.

Never tried it but i'm curious what happens if your test case does write_at_all (and forces collective buffering if necessary)

pkcoff commented 5 days ago

@roblatham00 yes write_at_all works with collective buffering enabled, however if I disable it with the romio_cb_write hint it fails with a bad address for me within IOR, however for some reason according to @colleeneb her reproducer works.

pkcoff commented 5 days ago

@roblatham00 @colleeneb So write_at_all with collective buffering works because the collective buffer is cpu memory on a host, the problem is with independant IO the file write will be given the GPU device buffer which isn't supported - I read this in the Intel OneAPI optimization guide - " File I/O is not possible from SYCL* kernels." So I don't know how this can be supported efficiently.....

roblatham00 commented 4 days ago

Thanks for trying that out, Paul.

so "all" we need to do is 1: detect if memory is host or device (how?) 2: memcpy into a scratch buffer before calling the posix read/write

of course, we need to be a little careful with huge requests so maybe we instead allocate a 16 MiB buffer and copy into that many times.

Memcpy is stupid fast, and writing to storage, even over slingshot, is not, so i'm not worried about performance.

In fact I just had a student of mine test out GPU direct for storage -- best case you get 25% more performance: that's not nothing but it's not worth spending a ton of engineering time on either.

pkcoff commented 3 days ago

@roblatham00 yeah imo safest to use the collective buffer if the rank is an aggregator, if not then allocate the scratch buffer of the cb size on the cpu and then write in chunks for large device buffers, memcpy'ing from the device.

pmodels / mpich