Use mpi4py-fft with CuPy arrays?

leofang commented 2 years ago

@dalcinl I had a tiny bit of time over the holidays to pick up this old task that has been on my mind for a couple of years. I noticed DistArray is the workhorse behind mpi4py-fft, however it requires subclassing numpy.ndarray https://github.com/mpi4py/mpi4py-fft/blob/ac510f8398f138bb860ed9f2580343e8b98cf799/mpi4py_fft/distarray.py#L10 which makes it not possible to use cupy.ndarray as a drop-in replacement, as CuPy currently does not support subclassing (https://github.com/cupy/cupy/issues/3972).

Is it absolutely necessary to make DistArray a subclass? Can it be tweaked (hopefully mildly) to become an array container to support both NumPy and CuPy arrays (and potentially other array/tensor libraries)? Could this change break backward compatibility? From my first glimpse it seems possible but I'd like to discuss further and see if I miss anything.

mikaem commented 2 years ago

Hi @leofang I think it's more correct to consider DistArray a convenience class to create distributed Numpy arrays of the right shape, whereas the PFFT class is the workhorse. So I don't really think DistArray is the correct place to look for a cuda drop-in replacement. Note that the Numpy work arrays of PFFT are created in the FFT class and perhaps that is a better place to start? With a serial cuda transform class?

leofang commented 2 years ago

Thanks, @mikaem! I am not sure I fully follow, maybe I still miss something...

Yes, DistArray is a convenient class to create an array with the right shape locally. But my understanding of the workflow is

Create a PFFT object with the desired transform specified
Use the PFFT object to create a local array using newDistArray
Perform the transform on the local array

In terms of code, I am looking at the tutorial:

import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray

N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)

From a CuPy developer/user's perspective, I expect that this code should just work after replacing numpy by cupy, but it'd require newDistArray to return a CuPy array.

mikaem commented 2 years ago

My point is that there are Numpy arrays within the PFFT class, or more precisely the FFT class. The arrays you set up with DistArray are copied into these before the transforms start. So just changing DistArray does not change the underlying routines. Transforms are not in-place. Which is why I think you need to add a serial fft class for cupy just to get started. Is there a serial cupy fft library?

brownbaerchen commented 11 months ago

What's the status of this? Support for subclassing has since been added to cupy (see the issue mentioned above) and cupy has FFTs as well. I am just starting to look into this, but I suppose it may be possible to replace np with cp in both DistArray and PFFT and be done with it now? Of course I have no idea about the details... Are there any plans to add cupy support from your side?

mikaem commented 11 months ago

There are no plans as far as I know. But contributions are welcome😃

leofang commented 11 months ago

I didn't have bandwidth for this, so feel free to give it a shot and let us know how it goes!

brownbaerchen commented 11 months ago

I started out with CuPyfying the DistArray class. There were a few issues, but I think so far nothing catastrophic. It is not as simple as replacing np with cp after all, as the signature for __new__ is different and the get function needs to stay on CPU, I believe. I will continue with the PFFT class next week.

I would like to merge this back here eventually. I guess you would reasonably ask the changes to be tested. However, testing code on GPU is not straightforward on GitHub, at least without paying. So is there any chance of merging this here? Do you have a way of testing on GPUs? Otherwise, I guess I would not have to pay as much attention to cleaning up the code. That's why I ask already now. We are currently working on testing CuPy code in pySDC by mirroring to a local gitlab and running the tests on the Jülich supercomputers, but it's unclear when this will be running and how the accounting for compute budget will work.

If you are interested, you can track my progress here.

leofang commented 11 months ago

I won't have time before the holidays, unfortunately, but after that I could help test it locally. I agree too that it should be merged if everything works locally.

For accessing GPUs in GitHub Actions, that's a longer discussion that won't happen over night, what I can share now is interested parties are working on this.

brownbaerchen commented 10 months ago

I made some progress, but there is still plenty of room for improvement and I was hoping you have some suggestions. First of all, I made some scaling tests with the current implementation on the JUWELS booster cluster in Jülich which has 96 CPU threads per node and 4 A100 40GB GPUs per node. The following is 3D c2c single precision back and forth FFTs with slab decomposition: weak_scaling strong_scaling

You can see up to three lines:

fftw+MPI: Current state of mpi4py-fft on CPUs
cupy+customMPI: GPU version of mpi4py-fft with custom Alltoallw
cuFFTMp+NVSHMEM: Distributed FFT library from NVIDIA that doesn't have Python bindings, but is included for comparison

The x-axis is per node to judge if the compute budged is better spent on CPUs or GPUs. Clearly, the GPUs perform pretty well and mpi4py-fft works on them pretty well with only minor changes. The weak scaling plot includes one data point with one GPU (1/4 node) which is basically a single 3D transform with no communication.

However, cuFFTMp is faster by about a factor of two in weak scaling. I see two big issues with the GPU version of mpi4py-fft. First, the default Alltoallw from mpi4py is really slow for some reason. I did not check what is going on under the hood, but the custom Alltoallw is based on a series of ring sends and is faster by about a factor of 10. To conserve memory, I block between each ring send. This can definitely be improved. Matching the performance of NVSHMEM with MPI is, however, not possible afaik, especially for strong scaling. Weak scaling is a higher priority for me, though. But a second big issue is memory operations which consume about 25% of compute time. First a 2D transform is performed on the slabs, then the data is transformed and a second 1D transform is applied. In the second transform, I can see that a kernel cupy_copy__complex64_complex64 is launched before and after the actual transform and the data is returned in 'F' ordering. I cannot see this kind of memory operation in cuFFTMp, but maybe I did miss it. Do you have suggestions as to how I could avoid these memory operations?

leofang commented 10 months ago

Thanks for working on this and sharing your finding, @brownbaerchen, this is very encouraging!

First, the default Alltoallw from mpi4py is really slow for some reason.

I'd let @dalcinl or @mikaem comment on this, but my hunch is it's related to the CUDA aware MPI implementation -- which MPI did you use, and how did you build it?

In the second transform, I can see that a kernel cupy_copy__complex64_complex64 is launched before and after the actual transform and the data is returned in 'F' ordering.

I think this roughly explains the perf difference between CuPy + mpi4py-fft and cuFFTMp + NVSHMEM. If you share your source code of the mpi4py-fft integration, I could take a look next week and try to see if there's any improvement I can suggest.

btw how did you use cuFFTMp in Python?

brownbaerchen commented 10 months ago

Regarding the slow Alltoallw, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host and back with this due to non-contiguous buffers? I should check that at some point.

I didn't actually run cuFFTMp in Python. I may try to build bindings for that at some point. I just ran this file from the NVIDIA samples repository.

My code is available on this branch. I guess the main thing is the FFT backend, which has the following code:

def _Xfftn_plan_cupy(shape, axes, dtype, transforms, options):
    import cupy as cp

    transforms = {} if transforms is None else transforms
    if tuple(axes) in transforms:
        plan_fwd, plan_bck = transforms[tuple(axes)]
    else:
        if cp.issubdtype(dtype, cp.floating):
            plan_fwd = cp.fft.rfftn
            plan_bck = cp.fft.irfftn
        else:
            plan_fwd = cp.fft.fftn
            plan_bck = cp.fft.ifftn

    s = tuple(np.take(shape, axes))
    U = cp.array(fftw.aligned(shape, dtype=dtype))  # TODO: avoid going via CPU
    _V = plan_fwd(U, s=s, axes=axes)
    V = cp.array(fftw.aligned_like(_V.get()))  # TODO: avoid going via CPU
    M = np.prod(s)

    # CuPy has forward transform unscaled and backward scaled with 1/N
    return (
        _Yfftn_wrap(plan_fwd, U, V, 1, {'s': s, 'axes': axes}),
        _Yfftn_wrap(plan_bck, V, U, M, {'s': s, 'axes': axes}),
    )

It's basically copy-paste from the numpy backend. I also tried cupyx.scipy.fftpack.fftn with overwrite_x=True, but the result was the same.

I don't construct plans myself because they are cached during the first transform and that is fine by me. But maybe there is a plan that avoids going to F ordering?

llukas commented 10 months ago

@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python?

brownbaerchen commented 10 months ago

@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python?

Thanks for mentioning this! Since I haven't done any work with pybind11 or JAX before, this is a bit opaque to me. I am wondering why this was done with JAX and not with CuPy. I want to do distributed FFTs in a library that relies on subclassing numpy arrays. Subclassing cupy arrays works just fine and transitioning to JAX would probably be feasible but I haven't tried. At this point, I don't know what the best course of action is.

I guess I have three options:

Figure out how to call cuFFT from CuPy more efficiently in the transform after transposing and continue in mpi4py-fft
Design a new datatype in my library based on JAX and use the bindings you mentioned
Make bindings for cuFFTMp for CuPy analogous to the ones using JAX

Any recommendations?

leofang commented 9 months ago

Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling fftw.aligned() and then converting to CuPy arrays. It's a waste of memory bandwidth. fftw.aligned() is just a convenient helper function allocating aligned CPU buffers. Could you try removing it and redo the benchmarks?

brownbaerchen commented 9 months ago

Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling fftw.aligned() and then converting to CuPy arrays. It's a waste of memory bandwidth. fftw.aligned() is just a convenient helper function allocating aligned CPU buffers. Could you try removing it and redo the benchmarks?

You're right, thanks for your input! I ignored this so far because it is only done once during setup. Indeed, just removing the detour to fftw.aligned() significantly speeds up instantiation of the PFFT object. But once it is instantiated, it doesn't make any difference. Since I didn't include the setup in the plots, the results don't change. In particular, the expensive memory operations remain. Still, your comment is very useful in practice!

leofang commented 9 months ago

Cool, if you send a draft PR to this repo, it might be easier for me and the maintainers (Mikael/Lisandro) here to review/suggest changes, and Mikael/Lisandro might have thoughts on the PR scope and your custom alltoallw. I did take another quick glance at your branch but didn't see other major issues.

Regarding the slow Alltoallw, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host and back with this due to non-contiguous buffers? I should check that at some point.

Totally forgot to follow up, sorry. Since you tried two flavors of MPI implementations, do you happen to know if they are both CUDA-aware (that is, you can just pass a GPU array to mpi4py and the underlying MPI impl knows how to handle GPU pointers)? If you already have mpi4py/CuPy in your local env, you can easily test them following Step 4 in https://github.com/conda-forge/openmpi-feedstock/pull/128#pullrequestreview-1789169410.

mikaem commented 9 months ago

Hi Sorry for not following up. This looks very promising, but unfortunately I have almost no experience with cuda or gpu in general and I don't think that I can be of much help. Not sure about @dalcinl ? If this works I believe it would be a great addition to mpi4py-fft though. A PR is welcome:-) Regarding Alltoallw, I remember we had some success experimenting with MPI environment variables on the supercomputer at Kaust, getting performance close to Alltoall. But it has been awhile since I looked at this and I'm getting old and forgetful.

brownbaerchen commented 9 months ago

@leofang, yes, both MPI implementations are built with CUDA support. I talked to the support here in Jülich and they also don't know why this is happening, but assured me that that I loaded the correct modules. It would be nice if someone could test this on a different machine maybe. Since it seems that NCCL is preferable over MPI anyways, maybe this doesn't matter too much, though.

leofang commented 9 months ago

@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of ompi_info. For MPICH, please share the output of mpichversion. Also, in both MPI implementations, is CUDA-aware UCX used?

brownbaerchen commented 9 months ago

@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of ompi_info. For MPICH, please share the output of mpichversion. Also, in both MPI implementations, is CUDA-aware UCX used?

Since the output is quite lengthy, I uploaded the output of ompi_info here and the output of mpichversion here. I do explicitly load a module enabling CUDA support for UCX and MPI. However, this is the default behaviour on this machine anyways as it is very GPU centric.

brownbaerchen commented 8 months ago

After @mhrywniak gave me some tipps on how to use NSIGHT, I could finally see where the expensive memory operations occur that I have been talking about. To recap: The launch of cupy_copy__complex64_complex64 kernels has held back performance of CuPy in mpi4py-fft significantly. It turns out that this kernel is launched when doing array[...] = data and can be replaced by cupy.copyto(array, data) to launch a D2D memcpy operation which is much faster. For low node counts the overhead compared to cuFFTMp is now about 1.5x instead of 2x as before. However, there is still a kernel cupy_multiply__complex64_complex_complex64, which is very expensive. It takes at least as long as the serial FFTs and is launched when doing the normalisation for the FFT. In the code this looks like array *= float. I tried replacing this with cupy.multiply(array, float), but it was not any faster. It seems odd to me that this multiplication is so relatively expensive and was wondering if you have any idea if it can be done more fast?

Also, I experimented with streams in the NCCL Alltoallw implementation, but I was not able to gain significant speedup with this. Suggestions on how to improve this are welcome!

Any updates on cuFFTMp python bindings?

dalcinl commented 8 months ago

However, there is still a kernel cupy_multiply__complex64_complex_complex64, which is very expensive.

Weird name, because ...

It takes at least as long as the serial FFTs and is launched when doing the normalisation for the FFT. In the code this looks like array *= float

... normalization is about scaling with a real value, not a complex value. Is that kernel promoting the float scaling factor to complex, next doing complex * complex operations? What if you try to create a view of the array as real floating, then multiply in-place by the scaling factor?

brownbaerchen commented 8 months ago

Good point @dalcinl! Indeed, multiplying a real view in-place results in a kernel called cupy_multiply__float32_float_float32. Unfortunately, it is not any faster...

I noticed something else, though, but I am not sure if it is 100% correct. So please confirm the following: In the _Xfftn_plan_... classes, e.g. here, the factor M is used to revert any scaling done by the FFT library. So if NumPy has unscaled forward transform and backward transform scaled by $1/n$, M is 1 for forward and $n$ for backward, such that the data comes out entirely unscaled. Only in the FFT class, e.g. here is the scaling applied that is visible to the user of mpi4py-fft. But NumPy allows to change the normalisation like so. So we can call the forward transform for NumPy and tell it not to scale it by giving norm='backward' and then call the backward transform and also tell NumPy not to scale it by passing norm='forward'. Then we choose M=1 for both in the _Xfftn_plan_numpy wrapper and save many operations. CuPy has the same normalisation as NumPy and the result seems to be the same if I prevent rescaling and choose M=1, but it's faster because it does two fewer cupy_multiply_... kernels. Does this have any side effects that I didn't consider?

dalcinl commented 8 months ago

So we can call the forward transform for NumPy and tell it not to scale it by giving norm='backward' and then call the backward transform and also tell NumPy not to scale it by passing norm='forward'. Then we choose M=1 for both in the _Xfftn_plan_numpy wrapper and save many operations.

Definitely. Good catch! The NumPy backed was kind of a proof of concept, I think we did not really think about the proper, performant way to do it.

CuPy has the same normalisation as NumPy and the result seems to be the same if I prevent rescaling and choose M=1, but it's faster because it does two fewer cupy_multiply_... kernels.

Of course.

Does this have any side effects that I didn't consider?

None that I can think of.

mpi4py / mpi4py-fft

Use mpi4py-fft with CuPy arrays? #14