Open colleeneb opened 1 week ago
The ROMIO component of MPICH is not currently GPU-aware, so I'm surprised this code doesn't just crash 😕. We have a issue to add support for this kind of usage, but it not actively being worked on. We could raise the priority if it is desirable for Aurora.
@raffenet I don't have any actual user requests for this, but I would imagine if a user was doing MPI with gpu buffers to avoid the overhead of copying back to the host they would also want to keep the data on the gpu when doing MPI-IO, so I would say it would make sense to prioritize this support.
@pkcoff I was talking about this with my student earlier this week... I think we can combine MPICH's GPU-aware-ness with ROMIO's two phase buffering and get GPU awareness for free in the collective i/o case. In a sense, ROMIO is packing/unpacking into its intermediate buffer.
file i/o occurs to/from the "cb_buffer_size" buffer, but data exchange among the processes happens with MPI point to point messaging which are already able to handle device memory.
Never tried it but i'm curious what happens if your test case does write_at_all (and forces collective buffering if necessary)
@roblatham00 yes write_at_all works with collective buffering enabled, however if I disable it with the romio_cb_write hint it fails with a bad address for me within IOR, however for some reason according to @colleeneb her reproducer works.
@roblatham00 @colleeneb So write_at_all with collective buffering works because the collective buffer is cpu memory on a host, the problem is with independant IO the file write will be given the GPU device buffer which isn't supported - I read this in the Intel OneAPI optimization guide - " File I/O is not possible from SYCL* kernels." So I don't know how this can be supported efficiently.....
Thanks for trying that out, Paul.
so "all" we need to do is 1: detect if memory is host or device (how?) 2: memcpy into a scratch buffer before calling the posix read/write
of course, we need to be a little careful with huge requests so maybe we instead allocate a 16 MiB buffer and copy into that many times.
Memcpy is stupid fast, and writing to storage, even over slingshot, is not, so i'm not worried about performance.
In fact I just had a student of mine test out GPU direct for storage -- best case you get 25% more performance: that's not nothing but it's not worth spending a ton of engineering time on either.
@roblatham00 yeah imo safest to use the collective buffer if the rank is an aggregator, if not then allocate the scratch buffer of the cb size on the cpu and then write in chunks for large device buffers, memcpy'ing from the device.
Hello,
This is to report an issue we are seeing with MPICH on Intel GPUs (related to an IOR issue from @pkcoff).
If we run a code (reproducer below) which calls MPI_File_write_at with a GPU device buffer, the code does not write to a file. It works fine if we use the host buffer.
Thanks! Let us know if this is expected or we're doing something wrong.
Reproducer
Expected Output
We expect the code to produce a file called "test" which has a size of 4 bytes. The code checks the size and prints it:
We get this output if we send the host buffer to the MPI call.
Actual Output
It does not put results in the file: