Open leofang opened 2 years ago
Hi @leofang
I think it's more correct to consider DistArray
a convenience class to create distributed Numpy arrays of the right shape, whereas the PFFT
class is the workhorse. So I don't really think DistArray
is the correct place to look for a cuda drop-in replacement. Note that the Numpy work arrays of PFFT
are created in the FFT
class and perhaps that is a better place to start? With a serial cuda transform class?
Thanks, @mikaem! I am not sure I fully follow, maybe I still miss something...
Yes, DistArray
is a convenient class to create an array with the right shape locally. But my understanding of the workflow is
PFFT
object with the desired transform specifiedPFFT
object to create a local array using newDistArray
In terms of code, I am looking at the tutorial:
import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)
From a CuPy developer/user's perspective, I expect that this code should just work after replacing numpy
by cupy
, but it'd require newDistArray
to return a CuPy array.
My point is that there are Numpy arrays within the PFFT class, or more precisely the FFT class. The arrays you set up with DistArray are copied into these before the transforms start. So just changing DistArray does not change the underlying routines. Transforms are not in-place. Which is why I think you need to add a serial fft class for cupy just to get started. Is there a serial cupy fft library?
What's the status of this? Support for subclassing has since been added to cupy (see the issue mentioned above) and cupy has FFTs as well.
I am just starting to look into this, but I suppose it may be possible to replace np
with cp
in both DistArray
and PFFT
and be done with it now? Of course I have no idea about the details...
Are there any plans to add cupy support from your side?
There are no plans as far as I know. But contributions are welcome😃
I didn't have bandwidth for this, so feel free to give it a shot and let us know how it goes!
I started out with CuPyfying the DistArray class. There were a few issues, but I think so far nothing catastrophic.
It is not as simple as replacing np
with cp
after all, as the signature for __new__
is different and the get
function needs to stay on CPU, I believe.
I will continue with the PFFT
class next week.
I would like to merge this back here eventually. I guess you would reasonably ask the changes to be tested. However, testing code on GPU is not straightforward on GitHub, at least without paying. So is there any chance of merging this here? Do you have a way of testing on GPUs? Otherwise, I guess I would not have to pay as much attention to cleaning up the code. That's why I ask already now. We are currently working on testing CuPy code in pySDC by mirroring to a local gitlab and running the tests on the Jülich supercomputers, but it's unclear when this will be running and how the accounting for compute budget will work.
If you are interested, you can track my progress here.
I won't have time before the holidays, unfortunately, but after that I could help test it locally. I agree too that it should be merged if everything works locally.
For accessing GPUs in GitHub Actions, that's a longer discussion that won't happen over night, what I can share now is interested parties are working on this.
I made some progress, but there is still plenty of room for improvement and I was hoping you have some suggestions. First of all, I made some scaling tests with the current implementation on the JUWELS booster cluster in Jülich which has 96 CPU threads per node and 4 A100 40GB GPUs per node. The following is 3D c2c single precision back and forth FFTs with slab decomposition:
You can see up to three lines:
Alltoallw
The x-axis is per node to judge if the compute budged is better spent on CPUs or GPUs. Clearly, the GPUs perform pretty well and mpi4py-fft works on them pretty well with only minor changes. The weak scaling plot includes one data point with one GPU (1/4 node) which is basically a single 3D transform with no communication.
However, cuFFTMp is faster by about a factor of two in weak scaling. I see two big issues with the GPU version of mpi4py-fft. First, the default Alltoallw
from mpi4py is really slow for some reason. I did not check what is going on under the hood, but the custom Alltoallw
is based on a series of ring sends and is faster by about a factor of 10. To conserve memory, I block between each ring send. This can definitely be improved. Matching the performance of NVSHMEM with MPI is, however, not possible afaik, especially for strong scaling. Weak scaling is a higher priority for me, though.
But a second big issue is memory operations which consume about 25% of compute time. First a 2D transform is performed on the slabs, then the data is transformed and a second 1D transform is applied. In the second transform, I can see that a kernel cupy_copy__complex64_complex64
is launched before and after the actual transform and the data is returned in 'F' ordering. I cannot see this kind of memory operation in cuFFTMp, but maybe I did miss it. Do you have suggestions as to how I could avoid these memory operations?
Thanks for working on this and sharing your finding, @brownbaerchen, this is very encouraging!
First, the default Alltoallw from mpi4py is really slow for some reason.
I'd let @dalcinl or @mikaem comment on this, but my hunch is it's related to the CUDA aware MPI implementation -- which MPI did you use, and how did you build it?
In the second transform, I can see that a kernel cupy_copy__complex64_complex64 is launched before and after the actual transform and the data is returned in 'F' ordering.
I think this roughly explains the perf difference between CuPy + mpi4py-fft and cuFFTMp + NVSHMEM. If you share your source code of the mpi4py-fft integration, I could take a look next week and try to see if there's any improvement I can suggest.
btw how did you use cuFFTMp in Python?
Regarding the slow Alltoallw
, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host and back with this due to non-contiguous buffers? I should check that at some point.
I didn't actually run cuFFTMp in Python. I may try to build bindings for that at some point. I just ran this file from the NVIDIA samples repository.
My code is available on this branch. I guess the main thing is the FFT backend, which has the following code:
def _Xfftn_plan_cupy(shape, axes, dtype, transforms, options):
import cupy as cp
transforms = {} if transforms is None else transforms
if tuple(axes) in transforms:
plan_fwd, plan_bck = transforms[tuple(axes)]
else:
if cp.issubdtype(dtype, cp.floating):
plan_fwd = cp.fft.rfftn
plan_bck = cp.fft.irfftn
else:
plan_fwd = cp.fft.fftn
plan_bck = cp.fft.ifftn
s = tuple(np.take(shape, axes))
U = cp.array(fftw.aligned(shape, dtype=dtype)) # TODO: avoid going via CPU
_V = plan_fwd(U, s=s, axes=axes)
V = cp.array(fftw.aligned_like(_V.get())) # TODO: avoid going via CPU
M = np.prod(s)
# CuPy has forward transform unscaled and backward scaled with 1/N
return (
_Yfftn_wrap(plan_fwd, U, V, 1, {'s': s, 'axes': axes}),
_Yfftn_wrap(plan_bck, V, U, M, {'s': s, 'axes': axes}),
)
It's basically copy-paste from the numpy backend. I also tried cupyx.scipy.fftpack.fftn
with overwrite_x=True
, but the result was the same.
I don't construct plans myself because they are cached during the first transform and that is fine by me. But maybe there is a plan that avoids going to F
ordering?
@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python?
@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python?
Thanks for mentioning this! Since I haven't done any work with pybind11 or JAX before, this is a bit opaque to me. I am wondering why this was done with JAX and not with CuPy. I want to do distributed FFTs in a library that relies on subclassing numpy arrays. Subclassing cupy arrays works just fine and transitioning to JAX would probably be feasible but I haven't tried. At this point, I don't know what the best course of action is.
I guess I have three options:
Any recommendations?
Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling fftw.aligned()
and then converting to CuPy arrays. It's a waste of memory bandwidth. fftw.aligned()
is just a convenient helper function allocating aligned CPU buffers. Could you try removing it and redo the benchmarks?
Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling
fftw.aligned()
and then converting to CuPy arrays. It's a waste of memory bandwidth.fftw.aligned()
is just a convenient helper function allocating aligned CPU buffers. Could you try removing it and redo the benchmarks?
You're right, thanks for your input! I ignored this so far because it is only done once during setup. Indeed, just removing the detour to fftw.aligned()
significantly speeds up instantiation of the PFFT
object. But once it is instantiated, it doesn't make any difference. Since I didn't include the setup in the plots, the results don't change. In particular, the expensive memory operations remain. Still, your comment is very useful in practice!
Cool, if you send a draft PR to this repo, it might be easier for me and the maintainers (Mikael/Lisandro) here to review/suggest changes, and Mikael/Lisandro might have thoughts on the PR scope and your custom alltoallw
. I did take another quick glance at your branch but didn't see other major issues.
Regarding the slow
Alltoallw
, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host and back with this due to non-contiguous buffers? I should check that at some point.
Totally forgot to follow up, sorry. Since you tried two flavors of MPI implementations, do you happen to know if they are both CUDA-aware (that is, you can just pass a GPU array to mpi4py and the underlying MPI impl knows how to handle GPU pointers)? If you already have mpi4py/CuPy in your local env, you can easily test them following Step 4 in https://github.com/conda-forge/openmpi-feedstock/pull/128#pullrequestreview-1789169410.
Hi
Sorry for not following up. This looks very promising, but unfortunately I have almost no experience with cuda or gpu in general and I don't think that I can be of much help. Not sure about @dalcinl ? If this works I believe it would be a great addition to mpi4py-fft though. A PR is welcome:-)
Regarding Alltoallw
, I remember we had some success experimenting with MPI environment variables on the supercomputer at Kaust, getting performance close to Alltoall. But it has been awhile since I looked at this and I'm getting old and forgetful.
@leofang, yes, both MPI implementations are built with CUDA support. I talked to the support here in Jülich and they also don't know why this is happening, but assured me that that I loaded the correct modules. It would be nice if someone could test this on a different machine maybe. Since it seems that NCCL is preferable over MPI anyways, maybe this doesn't matter too much, though.
@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of ompi_info
. For MPICH, please share the output of mpichversion
. Also, in both MPI implementations, is CUDA-aware UCX used?
@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of
ompi_info
. For MPICH, please share the output ofmpichversion
. Also, in both MPI implementations, is CUDA-aware UCX used?
Since the output is quite lengthy, I uploaded the output of ompi_info
here and the output of mpichversion
here.
I do explicitly load a module enabling CUDA support for UCX and MPI. However, this is the default behaviour on this machine anyways as it is very GPU centric.
After @mhrywniak gave me some tipps on how to use NSIGHT, I could finally see where the expensive memory operations occur that I have been talking about. To recap: The launch of cupy_copy__complex64_complex64
kernels has held back performance of CuPy in mpi4py-fft significantly. It turns out that this kernel is launched when doing array[...] = data
and can be replaced by cupy.copyto(array, data)
to launch a D2D memcpy operation which is much faster. For low node counts the overhead compared to cuFFTMp is now about 1.5x instead of 2x as before.
However, there is still a kernel cupy_multiply__complex64_complex_complex64
, which is very expensive. It takes at least as long as the serial FFTs and is launched when doing the normalisation for the FFT. In the code this looks like array *= float
. I tried replacing this with cupy.multiply(array, float)
, but it was not any faster. It seems odd to me that this multiplication is so relatively expensive and was wondering if you have any idea if it can be done more fast?
Also, I experimented with streams in the NCCL Alltoallw
implementation, but I was not able to gain significant speedup with this. Suggestions on how to improve this are welcome!
Any updates on cuFFTMp python bindings?
However, there is still a kernel
cupy_multiply__complex64_complex_complex64
, which is very expensive.
Weird name, because ...
It takes at least as long as the serial FFTs and is launched when doing the normalisation for the FFT. In the code this looks like
array *= float
... normalization is about scaling with a real value, not a complex value.
Is that kernel promoting the float scaling factor to complex, next doing complex * complex
operations?
What if you try to create a view of the array as real floating, then multiply in-place by the scaling factor?
Good point @dalcinl! Indeed, multiplying a real view in-place results in a kernel called cupy_multiply__float32_float_float32
. Unfortunately, it is not any faster...
I noticed something else, though, but I am not sure if it is 100% correct. So please confirm the following:
In the _Xfftn_plan_...
classes, e.g. here, the factor M
is used to revert any scaling done by the FFT library. So if NumPy has unscaled forward transform and backward transform scaled by $1/n$, M is 1 for forward and $n$ for backward, such that the data comes out entirely unscaled. Only in the FFT
class, e.g. here is the scaling applied that is visible to the user of mpi4py-fft. But NumPy allows to change the normalisation like so. So we can call the forward transform for NumPy and tell it not to scale it by giving norm='backward'
and then call the backward transform and also tell NumPy not to scale it by passing norm='forward'
. Then we choose M=1
for both in the _Xfftn_plan_numpy
wrapper and save many operations.
CuPy has the same normalisation as NumPy and the result seems to be the same if I prevent rescaling and choose M=1
, but it's faster because it does two fewer cupy_multiply_...
kernels.
Does this have any side effects that I didn't consider?
So we can call the forward transform for NumPy and tell it not to scale it by giving
norm='backward'
and then call the backward transform and also tell NumPy not to scale it by passingnorm='forward'
. Then we chooseM=1
for both in the_Xfftn_plan_numpy
wrapper and save many operations.
Definitely. Good catch! The NumPy backed was kind of a proof of concept, I think we did not really think about the proper, performant way to do it.
CuPy has the same normalisation as NumPy and the result seems to be the same if I prevent rescaling and choose
M=1
, but it's faster because it does two fewercupy_multiply_...
kernels.
Of course.
Does this have any side effects that I didn't consider?
None that I can think of.
@dalcinl I had a tiny bit of time over the holidays to pick up this old task that has been on my mind for a couple of years. I noticed
DistArray
is the workhorse behind mpi4py-fft, however it requires subclassingnumpy.ndarray
https://github.com/mpi4py/mpi4py-fft/blob/ac510f8398f138bb860ed9f2580343e8b98cf799/mpi4py_fft/distarray.py#L10 which makes it not possible to usecupy.ndarray
as a drop-in replacement, as CuPy currently does not support subclassing (https://github.com/cupy/cupy/issues/3972).Is it absolutely necessary to make
DistArray
a subclass? Can it be tweaked (hopefully mildly) to become an array container to support both NumPy and CuPy arrays (and potentially other array/tensor libraries)? Could this change break backward compatibility? From my first glimpse it seems possible but I'd like to discuss further and see if I miss anything.