Creation of 0D PencilArrays when global size cannot be factored by processor config

I'm not sure I've quite wrapped my head around how this is all supposed to work, but it seems that when I have a local first dimension of length 1 I obtain unuseable results. I'm using the third interface in the docs:

https://jipolanco.github.io/PencilFFTs.jl/dev/PencilFFTs/#PencilFFTs.PencilFFTPlan

which (I thought) allows me to specify the number of processes in the 2nd and 3rd dimensions (but I might be confused).

This code:

using MPI
using PencilFFTs
MPI.Init()

comm = MPI.COMM_WORLD
plan = PencilFFTs.PencilFFTPlan((1, 8, 44), PencilFFTs.Transforms.FFT!(), (2, 2), comm)
input = PencilFFTs.allocate_input(plan)

msg = [MPI.Comm_rank(comm),
       size(input[1])...,
       size(input[2])...,
       size(input[3])...]

function showmsg(msg)
    rank = msg[1]
    size1 = msg[2:4]
    size2 = msg[5:7]
    size3 = msg[8:10]

    msg = string("rank: ", rank,
               " size 1: ", size1,
               " size 2: ", size2,
               " size 3: ", size3)

    @show msg
end

messages = MPI.Gather(msg, 0, comm)

if MPI.Comm_rank(MPI.COMM_WORLD) == 0
    msg0 = messages[1:10]
    msg1 = messages[11:20]
    showmsg(msg0)
    showmsg(msg1)
end

when run with

$ mpiexec -n 2 julia --project mpitest.jl

produces

msg = "rank: 0 size 1: [1, 4, 22] size 2: [0, 8, 22] size 3: [0, 4, 44]"
msg = "rank: 1 size 1: [1, 4, 22] size 2: [0, 8, 22] size 3: [0, 4, 44]"

So we end up with arrays of size 0. Are singleton dimensions not handled correctly (bug?) or am I not using PencilFFTs correctly? I expected proc_dims = (2, 2) to use two processors for the 2nd and 3rd dimension (which seems to have happened for first(intput)) But input[2] and input[3] have 0 size so can't be used for computation?

Basically, I have the need to write generic code that can limit to the 2D case when one or more dimensions is singleton, but also be valid for 3D cases.

Hi, are you sure you ran your example with 2 MPI processes? I say this because proc_dims = (2, 2) means that the domain should be partitioned over a total of 2×2 = 4 processes (and along 2 directions at a time). For that reason, running that example with a number of processes different than 4 should give you an error:

ArgumentError: size of communicator (2) is different from size of Cartesian topology (4)

The results you get are the same that I obtain when running with 4 MPI processes (mpirun -n 4 ...). If instead of just printing the first two ranks I print all of them, I get:

msg = "rank: 0 size 1: [1, 4, 22] size 2: [0, 8, 22] size 3: [0, 4, 44]"
msg = "rank: 1 size 1: [1, 4, 22] size 2: [0, 8, 22] size 3: [0, 4, 44]"
msg = "rank: 2 size 1: [1, 4, 22] size 2: [1, 8, 22] size 3: [1, 4, 44]"
msg = "rank: 3 size 1: [1, 4, 22] size 2: [1, 8, 22] size 3: [1, 4, 44]"

which means that the data is in ranks 2 and 3.

Of course I agree this is not ideal since some processes are left with no data, and that ideally we shouldn't try to distribute data over singleton dimensions. The thing is that PencilFFTs wasn't designed with singleton dimensions in mind, and up to now there hasn't been an interest in this kind of usage. It might be possible to make some improvements, but I have to think of the proper way of doing this without affecting type stability.

That said, generically working with arbitrary numbers of dimensions is possible and is actually an important feature of PencilArrays / PencilFFTs. It's just that the way of doing it is rather by passing a different number of dimensions. In your example, you would just do:

nproc = MPI.Comm_size(comm)
plan = PencilFFTs.PencilFFTPlan((8, 44), PencilFFTs.Transforms.FFT!(), (nproc,), comm)

and then work with 2D arrays everywhere. Note that in this case data is distributed over a single dimension at a time.

More concretely, the Navier-Stokes tutorial (which I'm aware may lack some explanations) is mostly written in a dimension-generic way. Except from the initialisation steps, almost everything is dimension-independent, and it would be quite straightforward to convert the whole tutorial to 2D. This is actually how I would write a NS code that could work both in 2D and 3D.

To be fair, it should be possible to make the PencilFFTPlan interface easier / more intuitive to use. For instance, we could give a reasonable default value to the proc_dims argument, as it is already done in the Pencil constructors.

Ah sorry, I bungled the example a bit.

I think your output illustrates the issue still though, because there are arrays with size 0?

In your example, you would just do:

nproc = MPI.Comm_size(comm)
plan = PencilFFTs.PencilFFTPlan((8, 44), PencilFFTs.Transforms.FFT!(), (nproc,), comm)

and then work with 2D arrays everywhere.

Right, I see! If we are distributed in x, y, this also requires that we "pre-process" with a transpose from the configuration with a singleton local dimension into a configuration with the non-singleton dimension local, correct? Just to motivate (and maybe it's already clear) this is a generic situation in geophysics. What we have is a system which is extremely flat (both physically and computationally), with huge dimensions in x, y but relatively thin z dimensions (a large example is: 16,000 by 8,000 by 100). In addition, the entire algorithm boils down to:

Expensive quasi-local 3D tendency evaluations (best distributed in x, y, evaluated per process communicating only boundary data).
A few medium expensive integrals or tridiagonal solves in z (best if z is local).
One medium expensive two-dimensional solve in x, y, which requires global communication + FFTs (depending on the configuration, the FFT may just be a preconditioner in an iterative solve). This single, reduced dimensionality solve can be quite expensive compared to the rest, between 10-50% of the total cost of a time-step. Ideally though it's on the lower end of that, and our computation is dominated by 3D tendency evaluation.

Another issue (which is maybe an Oceananigans design issue, but which will be very difficult to change) is that arrays are always 3D. When we do a 2D problem, we reduce the size of a dimension to 1 (and also often tag that direction with a type to indicate it has been Flattened. So a problem that is 2D in (x, z) has the dimension (Nx, 1, Nz). For 2D FFTs embedded in a 3D problem, the 2D data has layout (Nx, Ny, 1) with both Nx and Ny distributed. Every now and then I wonder if this is the right design (there are tradeoffs...)

I think I understand that you're saying this problem can be solved now with the framework provided. It seemed to me (but everything is definitely not crystal clear for me yet...) that I might have to implement different solvers for different dimensionality that mix transposes and transforms according to PencilFFT rules: 3 for xy, xz, and yz, and then one for xyz, using the appropriate layouts and specifications in each case.

But

More concretely, the Navier-Stokes tutorial (which I'm aware may lack some explanations) is mostly written in a dimension-generic way.

maybe there is another solution for writing relatively simple Oceananigans code that can support the full spectrum of configurations. I'll continue to self-educate...

Feel free to close this if you don't think it's feasible to support singleton dimensions!

I think your output illustrates the issue still though, because there are arrays with size 0?

I agree! It would be good to handle this case in a more intelligent way.

When we do a 2D problem, we reduce the size of a dimension to 1 (and also often tag that direction with a type to indicate it has been Flattened. So a problem that is 2D in (x, z) has the dimension (Nx, 1, Nz).

In that case, maybe it would be possible to reshape the arrays to (Nx, Nz) just for the Poisson solver? I'd imagine the Flat tag can help with this (e.g. dropping singleton dimensions in a static way). I guess this might not be as easy at is sounds though...

I think I understand that you're saying this problem can be solved now with the framework provided. It seemed to me (but everything is definitely not crystal clear for me yet...) that I might have to implement different solvers for different dimensionality that mix transposes and transforms according to PencilFFT rules: 3 for xy, xz, and yz, and then one for xyz, using the appropriate layouts and specifications in each case.

Ideally you should be able to write a single solver, with perhaps some "glue" code to be able to deal with the different possible layouts. You can for example have a single call to PencilFFTPlan in your code, and have a function that generates its arguments (dims_global, transforms, proc_dims) based on the type of geometry you have. After quickly looking at the Oceananigans docs, I can imagine something like:

pencilfft_args(grid::RectilinearGrid{<:Any, Periodic, Periodic, Periodic}) = ...
pencilfft_args(grid::RectilinearGrid{<:Any, Periodic, Periodic, Flat}) = ...

Ok, I'll get to work on that and see what comes out!

With @jipolanco gracious help I'm starting to understand the logic. The problem is that configurations need to be "factorable" (somehow) in order for the work to be distributed amongst all processes (it's not actually a problem with singleton dimensions?) I rewrote the above script:

using MPI
using PencilFFTs
MPI.Init()

comm = MPI.COMM_WORLD
plan = PencilFFTs.PencilFFTPlan((2, 2, 64), PencilFFTs.Transforms.FFT!(), (2, 4), comm)
input = PencilFFTs.allocate_input(plan)

msg = [MPI.Comm_rank(comm),
       size(parent(input[1]))...,
       size(parent(input[2]))...,
       size(parent(input[3]))...]

function showmsg(msg)
    rank = msg[1]
    size1 = msg[2:4]
    size2 = msg[5:7]
    size3 = msg[8:10]

    msg = string("rank: ", rank,
               " size 1: ", size1,
               " size 2: ", size2,
               " size 3: ", size3)

    @show msg
end

messages = MPI.Gather(msg, 0, comm)

if MPI.Comm_rank(MPI.COMM_WORLD) == 0
    msg0 = messages[1:10]
    msg1 = messages[11:20]
    showmsg(msg0)
    showmsg(msg1)
end

which yields

$ mpiexec -n 8 julia --project mpitest.jl                                                                                                      [16:04:22]
msg = "rank: 0 size 1: [2, 1, 16] size 2: [2, 1, 16] size 3: [64, 0, 1]"
msg = "rank: 1 size 1: [2, 1, 16] size 2: [2, 1, 16] size 3: [64, 1, 1]"

The first transform (along x) is length 2, distributed between 8 processors (2 in y, 4 in z). The second transform is also length 2 (along y), and can be divided into 8 (2 in z, 4 in x). But the third transform must divide a 2x2 region in (x, y) among 8 processes --- not possible. In other words, the processor configuration must be able to factor each pencil (yz, xz, xy, in that order). Here the xy pencil cannot be factored by (2, 4).

Therefore, the only solution for this configuration is to let 4 processes sit idle while the 4 pencils are handled by just 4 out of the 8 total processes --- right? I believe this is the essential logic to the issue of the first post.

Note also that singleton dimensions are sometimes ok, because

plan = PencilFFTs.PencilFFTPlan((1, 2, 64), PencilFFTs.Transforms.FFT!(), (1, 2), comm)

produces a successful decomposition:

$ mpiexec -n 2 julia --project mpitest.jl                                                                                                      [16:15:17]
msg = "rank: 0 size 1: [1, 2, 32] size 2: [2, 1, 32] size 3: [64, 1, 1]"
msg = "rank: 1 size 1: [1, 2, 32] size 2: [2, 1, 32] size 3: [64, 1, 1]"

In this case, the requested configuration (1, 2) can factor all cases (EDIT: is this surprising?). Just to hammer it home (for me...)

plan = PencilFFTs.PencilFFTPlan((1, 2, 64), PencilFFTs.Transforms.FFT!(), (2, 1), comm)

$ mpiexec -n 2 julia --project mpitest.jl                                                                                                      [16:15:27]
msg = "rank: 0 size 1: [1, 1, 64] size 2: [2, 0, 64] size 3: [64, 2, 0]"
msg = "rank: 1 size 1: [1, 1, 64] size 2: [2, 1, 64] size 3: [64, 2, 1]"

This last case is interesting, because the second decomposition uses the permutation (y, x, z), which fails because (2, 1) cannot decompose the size (1, 2). But shouldn't it be possible to compute this problem if the intermediate configuration has the permutation (y, z, x)? Perhaps I'm missing something.

It's a generic "issue" that not all configurations admit even distribution of work?

I think this issue could become useful in 1 or 2 ways:

Write documentation to explain this perhaps (I can help!)
Perhaps, throw an interpretable error (with link to docs) when a non-factorable configuration is requested. This solves the problem which lead to the issue, in which a mysterious "array cannot be broadcasted" error arises instead (you can't do anything with 0D arrays...).

A few other things that may be achievable:

Attempt to find successful non-default transposition algorithms when the default ordering doesn't work
(not really necessary) implement a feature whereby some processes are allowed to idle while the non-factorable configuration is transformed.

That last feature makes sense to support for applications where the memory decomposition is driven by factors other than the cost of FFT + transpositions (but where FFTs are still essential to the algorithm). But if I'm not mistaken, it's also an edge case that in principle can be implemented as a special case on the user side.

The first transform (along x) is length 2, distributed between 8 processors (2 in y, 4 in z). The second transform is also length 2 (along y), and can be divided into 8 (2 in z, 4 in x). But the third transform must divide a 2x2 region in (x, y) among 8 processes --- not possible. In other words, the processor configuration must be able to factor each pencil (yz, xz, xy, in that order). Here the xy pencil cannot be factored by (2, 4).

Therefore, the only solution for this configuration is to let 4 processes sit idle while the 4 pencils are handled by just 4 out of the 8 total processes --- right? I believe this is the essential logic to the issue of the first post.

Yes, that's exactly it. It's not really a problem with singleton dimensions. Sorry if it wasn't very clear previously... I agree that some docs on this would be very helpful.

More generally, it's the problem of partitioning a 1D grid of N points over P processes. When N is not divisible by P, then the result will be necessarily unbalanced (with some processes having more data than others). The case N < P is an extreme particular case of this, in which some processes end up with no data at all.

This last case is interesting, because the second decomposition uses the permutation (y, x, z), which fails because (2, 1) cannot decompose the size (1, 2). But shouldn't it be possible to compute this problem if the intermediate configuration has the permutation (y, z, x)? Perhaps I'm missing something.

This should be possible, but I don't know if it's that easy to do it in a generic way though. Roughly speaking, the problem is that there's a logic behind the (Py, Pz) = (2, 1) -> (Px, Pz) = (2, 1) transposition that makes it difficult to allow a (Px, Pz) = (1, 2) intermediate configuration. The basic idea is that it's easier (and generally requires less communications) to leave the value of Pz untouched for this intermediate decomposition.

To see this, first consider your example with (Py, Pz) = (2, 1):

pencils_2-1

Here, both intermediate configurations (2, 1) and (1, 2) seem to be possible. However, it is difficult to extend this to a more generic case. Consider now an initial partitioning (Py, Pz) = (2, 3):

pencils_2-3

I think it's easy to see that the current intermediate configuration (Px, Pz) = (2, 3) only requires communications between subgroups of MPI processes. For example, process 1 only needs to communicate with process 2. And these subgroups are quite easy to construct and reason about in MPI.

Meanwhile, the alternative configuration (Px, Pz) = (3, 2) would be much more complicated. In this case, process 1 will need to send data to processes 3 and 5, and will receive data from processes 2, 3 and 4. This means more communication and added complexity.

The better solution would be to document what are the intermediate and output configurations starting from a given input configuration, which I'm aware it's not clear right now.

Write documentation to explain this perhaps (I can help!)

That would be great, your help is more than welcome here!

2. Perhaps, throw an interpretable error (with link to docs) when a non-factorable configuration is requested. This solves the problem which lead to the issue, in which a mysterious "array cannot be broadcasted" error arises instead (you can't do anything with 0D arrays...).

Yes, that's a good idea. I would do this when one ends up with 0D arrays. I hadn't thought about the specific problem with broadcasting.

Attempt to find successful non-default transposition algorithms when the default ordering doesn't work

Definitely, I think it's possible to automate the selection of a "good" ordering. The best would be for a user to be able to say "I want to partition the domain along 2 dimensions" and then for us to choose the best decomposition.

4. (not really necessary) implement a feature whereby some processes are allowed to idle while the non-factorable configuration is transformed.

That's also an option, though I'm not sure if it's worth the extra complexity.

jipolanco / PencilFFTs.jl

Creation of 0D PencilArrays when global size cannot be factored by processor config #43