Fuse packing/unpacking kernels for reshape3d_alltoall

As a general rule, everything that improves performance is of consideration and probably should be included. If you want, you can point me to the prototype for the code before you bother making a formal PR and you can even just give me the kernel so I can do the integration with the rest and other backends.

On the other hand, I don't recommend running so many nodes with so little data-per-node that the kernel launch will cause issues, but then again, it is a valid use case.

If you merge the for-loop into the kernel, then each iteration of the loop will manage different amount of data which in itself can lead to performance issues. This is precisely why I didn't do it using CUDA, and calling one packing kernel at a time makes it easier to pipeline packing and sending. I can certainly see how the SYCL logic will be easier to generalize (hopefully without loss of performance) and all-to-all doesn't pipeline, so we could have a boost of performance here.

icl-utk-edu / heffte

Fuse packing/unpacking kernels for reshape3d_alltoall #24