proc_setup_min_surface is missing some edge cases?

ahojukka5 commented 2 months ago

It looks like proc_setup_min_surface is missing some edge cases.

    // set initial guess, probably the worst grid but a valid one
    std::valarray<index> best_grid = {1, 1, num_procs};

Is this in all cases valid? For example, if one wants to do 1d or 2d FFT, would the last dimension be 1?

    for(int i=1; i<=num_procs; i++){
        if (num_procs % i == 0){
            int const remainder = num_procs / i;
            for(int j=1; j<=remainder; j++){
                if (remainder % j == 0){
                    std::valarray<index> candidate_grid = {i, j, remainder / j};
                    index const candidate_surface = surface(candidate_grid);
                    if (candidate_surface < best_surface){
                        best_surface = candidate_surface;
                        best_grid    = candidate_grid;
                    }
                }
            }
        }
    }

Here we iterate up to num_procs or num_procs/i.Should we somehow consider the simulation domain's size? For example, with the following changes to heffte_example_r2c:

--- a/examples/heffte_example_r2c.cpp
+++ b/examples/heffte_example_r2c.cpp
@@ -34,8 +34,8 @@ void compute_dft(MPI_Comm comm){

     // using problem with size 20x20x20 problem, the computed indexes are 11x20x20
     // direction 0 is chosen to reduce the number of indexes
-    heffte::box3d<> real_indexes({0, 0, 0}, {19, 19, 19});
-    heffte::box3d<> complex_indexes({0, 0, 0}, {10, 19, 19});
+    heffte::box3d<> real_indexes({0, 0, 0}, {19, 0, 0});
+    heffte::box3d<> complex_indexes({0, 0, 0}, {10, 0, 0});

     // check if the complex indexes have correct dimension
     assert(real_indexes.r2c(r2c_direction) == complex_indexes);
@@ -50,6 +50,11 @@ void compute_dft(MPI_Comm comm){
     // the proc_grid is chosen to minimize the real data, but use for both real and complex cases
     std::array<int, 3> proc_grid = heffte::proc_setup_min_surface(real_indexes, num_ranks);

+    // print proc_grid
+    if (me == 0){
+        std::cout << "The processor grid is: " << proc_grid[0] << " x " << proc_grid[1] << " x " << proc_grid[2] << std::endl;
+    }
+
     std::vector<heffte::box3d<>> real_boxes    = heffte::split_world(real_indexes,    proc_grid);
     std::vector<heffte::box3d<>> complex_boxes = heffte::split_world(complex_indexes, proc_grid);

We get grid 1 x 2 x 4:

using backend: stock
The global input contains 20 real indexes.
The global output contains 11 complex indexes.
The processor grid is: 1 x 2 x 4
rank 0 computed error: 1.144409e-05
rank 1 computed error: 0.000000e+00
rank 2 computed error: 0.000000e+00
rank 3 computed error: 0.000000e+00
rank 4 computed error: 0.000000e+00
rank 5 computed error: 0.000000e+00
rank 6 computed error: 0.000000e+00
rank 7 computed error: 0.000000e+00

Whereas doing other way

--- a/examples/heffte_example_r2c.cpp
+++ b/examples/heffte_example_r2c.cpp
@@ -30,12 +30,12 @@ void compute_dft(MPI_Comm comm){
     if (me == 0) std::cout << "using backend: " << heffte::backend::name<backend_tag>() << "\n";

     // the dimension where the data will shrink
-    int r2c_direction = 0;
+    int r2c_direction = 2;

     // using problem with size 20x20x20 problem, the computed indexes are 11x20x20
     // direction 0 is chosen to reduce the number of indexes
-    heffte::box3d<> real_indexes({0, 0, 0}, {19, 19, 19});
-    heffte::box3d<> complex_indexes({0, 0, 0}, {10, 19, 19});
+    heffte::box3d<> real_indexes({0, 0, 0}, {0, 0, 19});
+    heffte::box3d<> complex_indexes({0, 0, 0}, {0, 0, 10});

     // check if the complex indexes have correct dimension
     assert(real_indexes.r2c(r2c_direction) == complex_indexes);
@@ -50,6 +50,11 @@ void compute_dft(MPI_Comm comm){
     // the proc_grid is chosen to minimize the real data, but use for both real and complex cases
     std::array<int, 3> proc_grid = heffte::proc_setup_min_surface(real_indexes, num_ranks);

+    // print proc_grid
+    if (me == 0){
+        std::cout << "The processor grid is: " << proc_grid[0] << " x " << proc_grid[1] << " x " << proc_grid[2] << std::endl;
+    }
+
     std::vector<heffte::box3d<>> real_boxes    = heffte::split_world(real_indexes,    proc_grid);
     std::vector<heffte::box3d<>> complex_boxes = heffte::split_world(complex_indexes, proc_grid);

The program crashes:

using backend: stock
The global input contains 20 real indexes.
The global output contains 11 complex indexes.
The processor grid is: 2 x 2 x 2
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
  what():  terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot split the given number of indexes into the given set of mpi-ranks. Most liklely, the number of indexes is too small compared to the number of mpi-ranks.terminate called after throwing an instance of 'std::runtime_error'

I tried to fix this with the following change:

--- a/include/heffte_geometry.h
+++ b/include/heffte_geometry.h
@@ -646,8 +646,8 @@ inline std::array<int, 3> proc_setup_min_surface(box3d<index> const world, int n
     // using valarrays that work much like vectors, but can perform basic
     // point-wise operations such as addition, multiply, and division
     std::valarray<index> all_indexes = {world.size[0], world.size[1], world.size[2]};
-    // set initial guess, probably the worst grid but a valid one
-    std::valarray<index> best_grid = {1, 1, num_procs};
+    // set initial guess
+    std::valarray<index> best_grid = {1, 1, 1};

     // internal helper method to compute the surface
     auto surface = [&](std::valarray<index> const &proc_grid)->
@@ -656,18 +656,22 @@ inline std::array<int, 3> proc_setup_min_surface(box3d<index> const world, int n
             return ( box_size * box_size.cshift(1) ).sum();
         };

-    index best_surface = surface({1, 1, num_procs});
+    index best_surface = std::numeric_limits<index>::max();

-    for(int i=1; i<=num_procs; i++){
+    int const i_max = std::min((index)num_procs, all_indexes[0]);
+    for(int i=1; i<=i_max; i++){
         if (num_procs % i == 0){
-            int const remainder = num_procs / i;
-            for(int j=1; j<=remainder; j++){
-                if (remainder % j == 0){
-                    std::valarray<index> candidate_grid = {i, j, remainder / j};
-                    index const candidate_surface = surface(candidate_grid);
-                    if (candidate_surface < best_surface){
-                        best_surface = candidate_surface;
-                        best_grid    = candidate_grid;
+            int const j_max = std::min((index)(num_procs / i), all_indexes[1]);
+            for(int j=1; j<=j_max; j++){
+                if (j_max % j == 0){
+                    int const k = j_max / j;
+                    if (k <= all_indexes[2]){
+                        std::valarray<index> candidate_grid = {i, j, k};
+                        index const candidate_surface = surface(candidate_grid);
+                        if (candidate_surface < best_surface){
+                            best_surface = candidate_surface;
+                            best_grid    = candidate_grid;
+                        }
                     }
                 }
             }

But whereas this modification seems to give a good processor grid (at least for "1d corner case"), splitting the world is still failing.

mkstoyanov commented 2 months ago

I just came home back from vacation and I will need some time to walk over the details of your code.

However, the 1D case is not a valid use case for distributed FFTs within the heFFTe framework. The 3D and 2D cases can be reduced to batches of 1D FFTs, heFFTe reshapes the data so that batches of 1D transforms can be done entirely within an MPI rank. Using a huge distributed 1D signal would require an entirely different algorithm and an enormous amount of communication. I am not aware of applications or codes that use distributed FFTs, not that I know all there is out there.

The only realistic way to handle a large 1D transform is to move the data on one MPI rank (assuming it even fits in memory) and call the backend.

The proc_setup_min_surface() came from a specific application for 3D FFTs, it is possible that we missed the 2D case. I will look into it once I settle back into the office.

ahojukka5 commented 2 months ago

I acknowledge that the relevance of distributing 1D FFT is questionable, as it's most likely very fast to calculate using just one process. However, it might have some relevance in academic examples and very simplified models, initially implemented in 1D just to confirm that things are working before moving to 2D and 3D.

It is relatively easy to modify proc_setup_min_surface to work well with all dimensions. I suggest either making it work in this manner or throwing an error if the size of the domain in two dimensions is 1. It should not simply produce an incorrect configuration without any warnings as that's something users do not expect.

ahojukka5 commented 2 months ago

I created PR about this issue so that you can more easily examine the code.

mkstoyanov commented 2 months ago

I see that we baked the large problem assumption into proc_setup_min_surface() and we should fix that.

The PR is very clear but while fixing the issue it will not ensure that the produced grid has the correct number of MPI ranks. Check out #52

mkstoyanov commented 2 months ago

resolved in #51

icl-utk-edu / heffte

proc_setup_min_surface is missing some edge cases? #50