pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
541 stars 280 forks source link

romio: Add basic GPU-awareness #7108

Closed raffenet closed 3 weeks ago

raffenet commented 1 month ago

Pull Request Description

Allocate and use host buffers to perform I/O, if device buffers are detected. Fixes pmodels/mpich#7044.

For read APIs, the pattern is:

For write APIs:

Author Checklist

raffenet commented 1 month ago

test:mpich/ch4/most test:mpich/ch3/most

raffenet commented 1 month ago

test:mpich/ch4/most test:mpich/ch3/most

wkliao commented 1 month ago

May I suggest to make this PR a configurable option, if it is not already? This will allow future development in ROMIO to incorporate GPU-DISK direct I/O.

raffenet commented 1 month ago

May I suggest to make this PR a configurable option, if it is not already? This will allow future development in ROMIO to incorporate GPU-DISK direct I/O.

At the moment, this can be enabled/disabled at runtime with the MPIR_CVAR_ENABLE_GPU environment variable from MPICH. We can extend the configurability with a ROMIO-specific setting to facilitate GPUDirect Storage or other GPU-aware development strategies, e.g. pipelined copying.

roblatham00 commented 1 month ago

this code will always allocate a host region? I was expecting to see something call MPL's "is this device memory?" routine

raffenet commented 1 month ago

this code will always allocate a host region? I was expecting to see something call MPL's "is this device memory?" routine

The ptr query happens in the MPIR_gpu_host_alloc, which is exposed thru the MPIR_Ext interface. https://github.com/pmodels/mpich/blob/deb8fa9f5790475657da697b43c36a8a58ed5d7d/src/include/mpir_gpu_util.h#L36-L49

raffenet commented 1 month ago

I did it that way since MPICH uses the MPIR_CVAR_ENABLE_GPU CVAR to control GPU-awareness. Reimplementing the same in ROMIO using MPL directly would be extra work, but it would help to enable standalone builds in the long-run.

raffenet commented 3 weeks ago

@roblatham00 thanks! I will merge this as-is. We can create an issue to track work on potential optimizations. The ones that come to mind are:

  1. skip host buffer swap in the collective buffering case
  2. pipelined copy-to-host, write-to-file and read-from-file, copy-to-device for large buffers