jipolanco / PencilFFTs.jl

Fast Fourier transforms of MPI-distributed Julia arrays
https://jipolanco.github.io/PencilFFTs.jl/dev/
MIT License
77 stars 7 forks source link

PencilFFTs on GPUs? #3

Closed ali-ramadhan closed 2 years ago

ali-ramadhan commented 4 years ago

Hi @jipolanco this package looks really great, thank you for working on it! Documentation is great for such a new package. It's what I've been looking for to add distributed parallelism to Oceananigans.jl.

We run on both CPUs and GPUs so I was wondering if you knew whether PencilFFTs.jl would easily generalize to CuArrays? From skimming through the source code I feel like not much has to change as MPI functions should dispatch on the array type but maybe the FFT plans would have be done a little differently? I think cuFFT has a pretty similar interface to FFTW so it shouldn't be a big change, but cuFFT doesn't do REDFT and RODFT so some plans would not be supported I guess.

I will try to get a parallel version working with PencilFFTs.jl working on CPUs first though.

More than happy to help with adding GPU support.

jipolanco commented 4 years ago

Hi Ali, thanks again for your interest! I would be happy to assist with making PencilFFTs work with Oceananigans, so let me know if you have any questions or suggestions.

The CPU version should be relatively easy to implement. Just note that for now, PencilFFTs doesn't support in-place FFTs (which I noticed that you're using for the Poisson solver), so in principle you would need separate storages for inputs and outputs. In this case I would suggest to use real-valued inputs and real-to-complex (r2c) transforms. As an alternative, I'm planning to add support for in-place c2c and r2r transforms in the near future.

I'm also very interested in supporting GPUs. It should be simple to add an interface for CuArrays, but I'm not sure if the data transposition functions will work without modification. If they don't work, I think a first step would be to make things work for a single GPU (avoiding all the MPI data transposition machinery), which can already be useful by itself.

For the REDFT and RODFT transforms on the GPU, maybe we can provide a similar implementation as in Oceananigans, namely compute a full c2c FFT and then extract the required DCT or DST coefficients (if I understand your implementation correctly).

ali-ramadhan commented 4 years ago

I have to work on some other stuff over the next week or so, but hoping to dig into PencilFFTs.jl next week!

Just note that for now, PencilFFTs doesn't support in-place FFTs

Ah that's unfortunate but definitely not a barrier for now. I think I'll try to get something working first (ignoring performance), but that's a good point: I'll make sure to allocate an input and an output, thanks for the heads up.

I'm also very interested in supporting GPUs. It should be simple to add an interface for CuArrays, but I'm not sure if the data transposition functions will work without modification. If they don't work, I think a first step would be to make things work for a single GPU (avoiding all the MPI data transposition machinery), which can already be useful by itself.

Ah yeah that's a good point. I'll try to have a closer look at transpose.jl to better understand how PencilFFTs.jl might work with CuArrays.

@leios has done lots of work on multi-GPU transposes and might know how things go on a GPU.

For the REDFT and RODFT transforms on the GPU, maybe we can provide a similar implementation as in Oceananigans, namely compute a full c2c FFT and then extract the required DCT or DST coefficients (if I understand your implementation correctly).

Yeah that's essentially what we do following Makhoul (1980) so we can do without padding if we permute indices, but it's just specifically for REDFT01 and REDFT10. The 2D DFT however is not the composition of two 1D DFTs and is kinda painful to implement the 2D version. I wonder if it's worth supporting these transforms as cuFFT doesn't.

jipolanco commented 4 years ago

Ah that's unfortunate but definitely not a barrier for now. I think I'll try to get something working first (ignoring performance), but that's a good point: I'll make sure to allocate an input and an output, thanks for the heads up.

Actually, forget what I said. I just started working on in-place transforms, and it seems like it's going to be easy to implement. They'll probably be ready by the time you start working on this.

Ah yeah that's a good point. I'll try to have a closer look at transpose.jl to better understand how PencilFFTs.jl might work with CuArrays.

@leios has done lots of work on multi-GPU transposes and might know how things go on a GPU.

It would be great if you guys can help with the transposes on the GPU!

Yeah that's essentially what we do following Makhoul (1980) so we can do without padding if we permute indices, but it's just specifically for REDFT01 and REDFT10. The 2D DFT however is not the composition of two 1D DFTs and is kinda painful to implement the 2D version. I wonder if it's worth supporting these transforms as cuFFT doesn't.

Right, in that case I agree with you and I'd say it's not worth it to support that kind of transforms.

ali-ramadhan commented 4 years ago

Actually, forget what I said. I just started working on in-place transforms, and it seems like it's going to be easy to implement. They'll probably be ready by the time you start working on this.

That's awesome! I wonder how much that will improve the benchmark vs. P3DFFT (or if the benchmark is supposed to allocate).

jipolanco commented 4 years ago

I'm guessing it won't change much, since the allocated buffers are persistent (they're a field of PencilFFTPlan). In other words, allocations happen the first time the plan is executed, and then the temporary buffers are never garbage collected until the plan itself is destroyed. So I expect the cost of allocations to be negligible, even for out-of-place transforms. On the other hand, memory usage will be reduced, that's for sure.

Specifically for the P3DFFT comparisons, there's actually another problem, which is that P3DFFT v2 (i.e. the Fortran version) only does real-to-complex (r2c) transforms. For now I'm not planning on supporting in-place r2c transforms (as opposed to c2c or r2r), since they are much more complicated because both the size and the type of the data change from input to output. FFTW.jl itself doesn't support in-place r2c for the same reasons (even though there's an open PR to do this...).

ali-ramadhan commented 3 years ago

Sorry for going silent for over a year, finally started adding MPI support for Oceananigans.jl starting with just the CPU and PencilFFTs.jl worked great!

For GPU support, it seems that PencilArrays.jl might readily support CuArrays which would provide distributed transposes on GPUs.

Then CUDA.CUFFT can provide the necessary transforms. It's missing some transforms that FFTW provides (e.g. R2R DCTs and DSTs, and CUFFT only acts on batched dimensions (https://github.com/JuliaGPU/CUDA.jl/issues/119)) but even adding GPU support for PencilFFTs.Transforms.FFT! would be super useful.

Would it make sense to first add GPU/CuArrays tranpose tests to PencilArrays.jl?

jipolanco commented 3 years ago

It would be great if we could add support for GPU arrays!

Yes, I think the first step would be to make sure that PencilArrays wrapping CuArrays work as expected. As you suggest, we could start by adding GPU transpose tests to PencilArrays.

From the PencilFFTs side, I think there's not much to do to support GPU arrays, other than choosing the right FFT implementation based on the array types. Some parts of the plan creation code may need to be adapted for the kind of array as well.

Lightup1 commented 2 years ago

Is it now support for CuArray?

jipolanco commented 2 years ago

Hi, support for CuArray is not completely done but it shouldn't be too much work. I'll try to look at that next week.

jipolanco commented 2 years ago

@Lightup1 It would be great if you could give it a try now. It needs version 0.14 (and PencilArrays v0.17.5). See #48 for a small example.

Lightup1 commented 2 years ago

Glad to know that! I'll give a try on multi-gpu nodes

Lightup1 commented 2 years ago

Dear @jipolanco , I have another question on benchmarks. If I use Benchmarktools, what is the typicall and proper setting for PencilFFTs?

jipolanco commented 2 years ago

Using BenchmarkTools with MPI is a bit tricky since processes need to be synchronised. But it's possible.

You can look at this thread. In there I proposed a solution that used to work for me if I remember correctly.