Algorithm for combining FFTs and tridiagonal solves

glwagner commented 2 years ago

I'd like to implement an algorithm that combines distributed in-place FFTs in (x, y) or dimensions (1, 2) with a tridiagonal solve in z (dimension 3) in Oceananigans.

Due to the presence of integrals (in addition to the tridiagonal solve) in our algorithm, the 3rd dimension must be local.

For 3D FFTs, we thus create PencilArrays which have the permutation (z, y, x) with respect to Oceananigans data structures (we do this perturbation manually / externally from PencilFFTs): https://github.com/CliMA/Oceananigans.jl/pull/2513. This seems to work, allowing us to decompose our data in (x, y) and use the algorithm 1) compute the RHS of our Poisson system, simultaneously permuting the data to (3, 2, 1) and filling the PencilArray; 2) solve the Poisson equation with FFTs (so easy thanks to PencilFFTs!); 3) "unpermute" the data and store in the Oceananigans variable. Decomposing in (x, y) is crucial because our problems (like the ocean) are often very flat.

For an algorithm that combines 2D in-place FFTs with tridiagonal solves it seems we need a slightly different approach. Talking with @simone-silvestri I think the algorithm is:

Compute the RHS of the Poisson system and store in PencilArray. The z-direction is local, like the native Oceananigans data.
Without performing an FFT, transpose the PencilArray so y is local
Perform FFT along y.
Transpose so that x is local.
Perform FFT along x.

Now the forward transform is complete, and the data has permutation (x, y, z) --- as usual, the reverse permutation that we started with for in-place transforms. But note that we need z to be local for the tridiagonal solve, so the next steps are

Reverse the transpose in step 4 so y is local.
Reverse the transpose in step 2 to z is local.
Tridiagonal solve in z.
Repeat the transpose in step 2 so y is local.
Repeat the transpose step 4 so z is local.
Backwards transform (3 more steps) to obtain the solution in physical space, with the layout (z, y, x) where z is local.

I think the pieces to write this algorithm are all here in PencilFFTs, I'm just not sure how to achieve it, so opening this issue! I'll list a few things I tried below.

glwagner commented 2 years ago

I think that steps 1-5 can be achieved with the current interface for in-place transforms. The main barrier is that I cannot specify NoTransform() along with other in-place specifications:

using MPI
using PencilFFTs
MPI.Init()

comm = MPI.COMM_WORLD

transforms = (
    PencilFFTs.Transforms.NoTransform(),
    PencilFFTs.Transforms.FFT!(),
    PencilFFTs.Transforms.FFT!()
)

plan = PencilFFTs.PencilFFTPlan((16, 32, 64), transforms, (2, 2), comm)
input = PencilFFTs.allocate_input(plan)

throws

[1] ArgumentError: cannot combine in-place and out-of-place transforms: (NoTransform, FFT!, FFT!)

This might be a trivial fix.

I guess if that is fixed, then we can also create a "plan" for the tranpositions in steps 7-8, 10-11 with transform=NoTransform(). But maybe there's a better way / other interface for just performing transpositions?

glwagner commented 2 years ago

Maybe I'm answering my own question: it looks like there's a function transpose! provided by PencilArrays. Perhaps I can call transpose! to achieve steps 1, 7-8, and 10-11? Then I just need to create an appropriate plan for the 2D transpose.

jipolanco commented 2 years ago

I think NoTransform is exactly what you're looking for!

Well, almost... There's a NoTransform! variant (note the bang!) which is compatible with in-place transforms.

jipolanco commented 2 years ago

Maybe I'm answering my own question: it looks like there's a function transpose! provided by PencilArrays. Perhaps I can call transpose! to achieve steps 1, 7-8, and 10-11? Then I just need to create an appropriate plan for the 2D transpose.

Exactly, you want to use transpose! there.

Reverse the transpose in step 4 so y is local.

Reverse the transpose in step 2 to z is local.

If I understood correctly, for 7-8 you would do:

transpose!(input[2], input[3])  # from local-x to local-y
transpose!(input[1], input[2])  # from local-y to local-z

Tridiagonal solve in z.

You might want to work with the array u = parent(input[1]) here, if you prefer to "unwrap" the PencilArray wrapper.

Repeat the transpose in step 2 so y is local.

Repeat the transpose step 4 so z is local.

transpose!(input[2], input[1])  # from local-z to local-y
transpose!(input[3], input[2])  # from local-y to local-x

Backwards transform (3 more steps) to obtain the solution in physical space, with the layout (z, y, x) where z is local.

jipolanco commented 2 years ago

But perhaps you can avoid the extra transpositions (which can be quite expensive), and work with the following order of transforms:

transforms = (
    PencilFFTs.Transforms.FFT!(),
    PencilFFTs.Transforms.FFT!(),
    PencilFFTs.Transforms.NoTransform!(),
)

If I understood correctly, in this case you wouldn't even need to permute data from (x, y, z) to (z, y, x) before doing the transforms. After such a transform, your z direction would be local, and you could directly perform the tridiagonal solves and integrals over local (and contiguous) data.

glwagner commented 2 years ago

If I understood correctly, in this case you wouldn't even need to permute data from (x, y, z) to (z, y, x) before doing the transforms. After such a transform, your z direction would be local, and you could directly perform the tridiagonal solves and integrals over local (and contiguous) data.

Ah yes! We can use that algorithm if 1) we distribute only in y, or 2) if PencilFFTs allows us to distribute data along the first dimension, so we can distribute in x, y -- I think?

It's a topic for another time, but I don't think we've found evidence that the contiguity of the z dimension yields performance benefit (despite the theoretical expectation). @christophernhill might have more to say?

EDIT: a third possibility is that we support a configuration that explicitly disallows any component of the algorithm that requires z to be local (nitty gritty: vertical integrals for hydrostatic pressure or tridiagonal solves for vertically implicit time integration are disallowed). This would require some development to support but is certainly possible (though we're still investigating whether eliminating all z-integrals from other parts of the algorithm has undesirable numerical properties, see https://github.com/CliMA/Oceananigans.jl/pull/1910). That "z-distributed" mode would then support a faster solver algorithm without extra transposes.

glwagner commented 2 years ago

I think NoTransform is exactly what you're looking for!

Well, almost... There's a NoTransform! variant (note the bang!) which is compatible with in-place transforms.

💡 oh my

jipolanco commented 2 years ago

Ah yes! We can use that algorithm if 1) we distribute only in y, or 2) if PencilFFTs allows us to distribute data along the first dimension, so we can distribute in x, y -- I think?

Ahh of course, your data is distributed in x, y, I knew I was forgetting something :) In that case yes, I completely agree.

It's a topic for another time, but I don't think we've found evidence that the contiguity of the z dimension yields performance benefit (despite the theoretical expectation).

Interesting!

glwagner commented 2 years ago

You might want to work with the array u = parent(input[1]) here, if you prefer to "unwrap" the PencilArray wrapper.

good tip!

christophernhill commented 2 years ago

If I understood correctly, in this case you wouldn't even need to permute data from (x, y, z) to (z, y, x) before doing the transforms. After such a transform, your z direction would be local, and you could directly perform the tridiagonal solves and integrals over local (and contiguous) data.

Ah yes! We can use that algorithm if 1) we distribute only in y, or 2) if PencilFFTs allows us to distribute data along the first dimension, so we can distribute in x, y -- I think?

It's a topic for another time, but I don't think we've found evidence that the contiguity of the z dimension yields performance benefit (despite the theoretical expectation). @christophernhill might have more to say?

@jipolanco - we have implemented the tridiagonal solve option for the final dimension. even though the computational operation count is lower we haven't always found that the execution time on a GPU is better. The tridiagonal solve has more data dependencies that seem to need stalling/bubbles etc... on GPU.

@glwagner the tridiagonal option is the only option I think we have in Oceananigans currently for FFT based eigenmode solver non-uniform vertical grid? I did do some 1d tests where I somewhat convinced myself there is a way to treat the FFT problem that allows a stretched grid in a single dimension. However, I haven't tried that in the Oceananigans code to prove it works. Maybe now is a good time to resurrect that?

glwagner commented 2 years ago

However, I haven't tried that in the Oceananigans code to prove it works. Maybe now is a good time to resurrect that?

We should definitely look at that in another issue! That would be cool...

@christophernhill the performance thing I was referring to is the fact that construct our arrays so that z is the slowest dimension of the array, even though we perform a few integrals in that direction. It seemed to me there was some thought behind this (ie while putting it first makes sense in theory, it doesn't often yield performance gains). Putting z last is important for the user interface (though we could always use a wrapper / indirection to make z the fast dimension but have it "look like" the third dimension).

jipolanco commented 2 years ago

EDIT: a third possibility is that we support a configuration that explicitly disallows any component of the algorithm that requires z to be local (nitty gritty: vertical integrals for hydrostatic pressure or tridiagonal solves for vertically implicit time integration are disallowed). This would require some development to support but is certainly possible (though we're still investigating whether eliminating all z-integrals from other parts of the algorithm has undesirable numerical properties, see CliMA/Oceananigans.jl#1910). That "z-distributed" mode would then support a faster solver algorithm without extra transposes.

I understand that distributing over (x, y) makes a lot of sense for the kind of flat geometry common to geophysical applications. I'm thinking that it might be possible to transform data distributed over (x, y). For this, you need to consider the data that enters the Poisson solver as the output of a PencilFFTs transform. That is, your transform needs to be a backwards FFT.

Very roughly, you would have something like:

transforms = (
    PencilFFTs.Transforms.BFFT!(),
    PencilFFTs.Transforms.BFFT!(),
    PencilFFTs.Transforms.NoTransform!(),
)

plan = PencilFFTPlan(dims, transforms, ...)

input = PencilFFTs.allocate_input(plan)  # this can stay the same for in-place plans

input[3] = ...  # (**) copy data to transform output
plan \ input   # perform the inverse of a backward FFT (=> forward FFT!)

# transformed data is in input[1] (distributed in y, z)

I still need to think this through, but I hope it makes sense. Just note that, in the (**) step, data might need to be permuted so that the z dimension is contiguous (unless permutations are disabled when calling PencilFFTPlan). Hopefully the partition tool that we discussed can help automate this whole thing.

glwagner commented 2 years ago

@kburns pointed out that this:

Reverse the transpose in step 4 so y is local.

Reverse the transpose in step 2 to z is local.

Tridiagonal solve in z.

Repeat the transpose in step 2 so y is local.

Repeat the transpose step 4 so z is local.

Backwards transform (3 more steps) to obtain the solution in physical space, with the layout (z, y, x) where z is local.

can be done with fewer transposes. Recall the configuration is (x, y, z), so we can swap x, z:

Transpose x and z to obtain (z, y, x)
Tridiagonal solve in z
Transpose z and x to obtain (x, y, z)
Backwards transform (3 more steps).

So that reduces the number of required transposes by 2 (to 6). But it does require an extra temporary array.

jipolanco commented 2 years ago

That's right, that sounds like a better alternative.

Right now I'm not sure if this is possible due to (artificial) limitations in PencilArrays:

when written as a sorted tuple, the decomposed dimensions must be almost the same, with at most one difference. For instance, if the input of a 3D dataset is decomposed in (2, 3), then the output may be decomposed in (1, 3), but not in (1, 2). If the decomposed dimensions are the same, then no transposition is performed, and data is just copied if needed.

This means that, starting from the configuration (2, 3) (i.e. "local in x"), right now it's possible to transpose to (1, 3) (local in y) but not to (2, 1) (local in z). But it should be hopefully quite easy to lift this restriction to allow transpositions such as (2, 3) -> (2, 1), since the implementation is basically the same. I'll see what I can do.

glwagner commented 2 years ago

No rush for sure! You could always add a fifth temporary and perform a local transpose before doing the distributed transpose (into the "fourth" temporary) to work around that, probably at minimal cost...

jipolanco commented 2 years ago

It just needed some very minor changes :)

https://github.com/jipolanco/PencilArrays.jl/pull/57

I'll merge this and tag PencilArrays v0.17 once tests pass.

Basically, as shown in the example in that PR, you will need to create a new Pencil configuration with decomp_dims = (2, 1), and allocate an extra array in the new configuration to hold the result of the transposition. Let me know if things are unclear!

glwagner commented 2 years ago

Cool!

jipolanco / PencilFFTs.jl

Algorithm for combining FFTs and tridiagonal solves #44