icl-utk-edu / heffte

BSD 3-Clause "New" or "Revised" License
20 stars 15 forks source link

Fuse packing/unpacking kernels for reshape3d_alltoall #24

Open mabraham opened 1 year ago

mabraham commented 1 year ago

Currently reshape3d_alltoall for N ranks runs N packing and N unpacking kernels respectively before and after the MPI_Alltoall. As rank count grows, the overhead of launching and waiting on those kernels grows linearly with N. In sufficiently regular cases, the loop over ranks in heffte::reshape3d_alltoall::apply_base() can be lowered into the device kernel. I have working SYCL code that does that and shows a clear performance improvement even for small N. Is this an optimization you'd consider incorporating if I contribute it?

mkstoyanov commented 1 year ago

As a general rule, everything that improves performance is of consideration and probably should be included. If you want, you can point me to the prototype for the code before you bother making a formal PR and you can even just give me the kernel so I can do the integration with the rest and other backends.

On the other hand, I don't recommend running so many nodes with so little data-per-node that the kernel launch will cause issues, but then again, it is a valid use case.

If you merge the for-loop into the kernel, then each iteration of the loop will manage different amount of data which in itself can lead to performance issues. This is precisely why I didn't do it using CUDA, and calling one packing kernel at a time makes it easier to pipeline packing and sending. I can certainly see how the SYCL logic will be easier to generalize (hopefully without loss of performance) and all-to-all doesn't pipeline, so we could have a boost of performance here.