Pairwise task performance in HP tradequeues

LonelyCat124 commented 3 years ago

Right now the main limiting factors of the performance of the code for the DL_MESO example is the pairwise task performance (and the inability to trace them, but thats a separate issue which requires input from Legion team).

Right now, the basis of the algorithm is:

Divide the domain into cells of average size > 400/800.
For each cell in the system, perform the self task in an n^2 way.
For each pair of cells, perform the cross-cell interactions in an n^2 way.

This results in a couple of contrasting issues:

Individual tasks are very expensive. Performing a naive n^2 operation at ~400 particles takes multiple milliseconds, and with the volume of cells in a large simulation (e.g. 512k particles), there are ~28k tasks per launch, which means runtime is guaranteed to be in the ~100s timeframe. With smaller cells where r_cell ~= r_cut the execution time of cells is much lower. Right now in dl_meso for the large testcase, the cell dimension are ~10.772174, whilst r_cut is 1.0, meaning for an average cell pair almost none of the particles involved are actually in range!
With smaller tasks, overheads of launching tasks becomes increasingly problematic. This can be seen by reducing the ppc count to ~200 particles, then the runtime dramatically increases. This is for 2 reasons: 1) tradequeue_pull and tradequeue_push tasks are minute and overheads cripple their performance. 2) The actual tasks also get eaten up by runtime overheads (TBC).

The likely solution is then two-fold: 1) Grouping task launches to not simple cell to cell, but groups of cell to cell or groups of cells to groups of cells. 2) Improved neighbour search algorithm (such as sorting particles) to keep cell sizes larger.

LonelyCat124 commented 3 years ago

I'm still trying to find concrete evidence of the 2nd issue above for the pairwise tasks.

One thing that is particularly notable is that the tradequeue tasks perform horribly with few particles per cells - the time between execution massively outscaling the execution of tradequeue tasks.

One solution is to attempt to add more work to tradequeue tasks somehow (e.g. particle sorting).

LonelyCat124 commented 3 years ago

My first though on cell grouping is something like modifying pair tasks to:

__demand(__leaf)
task pairwise_task( supercell1 : region(ispace(int1d, part)), supercell2 : region(ispace(int1d, part)), config : region(ispace(int1d, part)), allparts : region(ispace(int1d, part), cell_partition : partition(disjoint, allparts, ispace(int3d)), cell1 : int3d, cell2 : int3d)
where reads(supercell1...), reads(supercell2...), reduces+(supercell1...), reads(config...), reduces+(config...) do

--Loop through subcells inside supercell1 and interact with subcells in supercell2 in a "smart" way
end

and having 2 "cell partitions", based on 2 separate cell indexes in particles:

fspace neighbour_part{

 cell_id : int3d,
 supercell_id : int3d,
 x_cell : int1d,
 _valid : bool,
 _transfer_dir: int,
 _transfer_pos: int1d,
}

Whether the x_cell value should be x_supercell would need testing (but I'd lean towards probably yes). I would then have two levels of cells:

cell_id from {0,0,0} to {x,y,z}
supercell_id from {0,0,0} to {xx,yy,zz} where xx,yy,zz are some fraction of x,y,z. Since x,y,z are unlikely to be divisible, the last supercells would probably not be square, which may complicate matters slightly as well in the tasks, as we'd have to handle the last supercells in some sensible way (probably easiest is to explictly check if supercell_id.x*cells_per_supercell_x + x < x_count or similar).

Making this change also probably means that tradequeues want to work in a similar manner (on supercells instead of cells). I think there would be a tradequeue per supercell (so of size similar to the current tradequeues), however the buffer would be on a per-cell basis, so might need to be slightly larger than just dividing current buffer size by some number.

Tradequeues would then have two operations to do - managing movement inside a supercell as well as the tradequeue movement between supercells.

LonelyCat124 commented 3 years ago

Sorted interactions can be combined or done independently of the above, and could be done on a supercell or cell level. In this example code I'm going to assume we don't have the above scheme implemented/used. It also might be dependent on cell sizes being >= cutoff radius (which seems likely for most cases anyway).

As with the tradequeues, we need some storage regions:

neighbour_init.SortArrays = terralib.newlist()
neighbour_init.SortArraysByCell = terralib.newlist()
for i = 1, 26 do
  neighbour_init.SortArrays:insert( regentlib.newsymbol() )
  neighbour_init.SortArraysByCell:insert( regentlib.newsymbol() )
end
....

  [(function() local __quotes = terralib.newlist()
     for i = 1,26 do
     __quotes:insert(rquote
       --Create a sort array for each direction globally, tot_parts includes padding for safety
       var sortedarray_indexSpace = ispace( int1d, tot_parts )
       var [neighbour_init.SortArrays[i]] = region(sortedarray_indexSpace, int1d); -- int1d might be some special field space for these values
       for point in [neighbour_init.SortArrays[i]].ispace do
           var x : int1d = --flatten padded_particle_array[point].neighbour_part_space.cell_id to a 1D value
           [neighbour_init.SortArrays[i]][point].x = x;
       end
      var tempspace = ispace(int1d, --max_x from loop);
       var [neighbour_init.SortArraysByCell[i]] = partition(complete, [neighbour_init.SortArrays[i]].x, tempspace);
     end)
   end
   return __quotes
   end) ()];

The premise is we have one for each direction between cell pairs (of which there are 26 directions), and like the tradequeues we use 1 to 26 to match their usecases. We also must make use of the same directions array , and also for sanity create a unit vector array that matches, i.e.:

local unit_vectors = terralib.newlist({
rexpr int3d({-5.773502691896258e-01, -5.773502691896258e-01, -5.773502691896258e-01}) end, --this is 1/sqrt(3)
rexpr int3d({-7.071067811865475e-01, -7.071067811865475e-01, 0}) end, --this is 1/sqrt(2)?
....

Notably one improvement that we could take from SWIFT is that SWIFT only stores 13 unit vectors , and then has two arrays to determine which to refer to, e.g. in sort_part.h these are sortlistID (which does direction to unit vector/sort list) and then runner_flip which says which direction to traverse the array (forward or backward). This means cells only need sorting and storing 13 times, which saves computation and storage space.

To actually do the sorting, the basic premise is contained within https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post126s2-file3.pdf , the idea being for each unit vector the particle's positions are projected onto the vector. I'm going to update this later with how exactly this works as its been a while and I forget.

LonelyCat124 commented 3 years ago

I'm pretty sure the way the sorted cell lists (pseudo-verlet lists) work, is that the particles are sorted in each cell based on the dot product of their position with each unit vector.

It can be shown that for a pair of particle pi and pj that if the distance between them along the unit vector is larger than the cutoff radius, then the distance between the particles in space is also larger than the cutoff.

For a given pair of cells, we can then minimise the neighbour searching:

for (int i = count_i-1; i >= 0; i--){
    for(int j = 0; i < count_j; j++){
        if( vector_pos[j] - vector_pos[i] > r2)
            --Exit both loops
        --Otherwise check if particles are actually in range in space
    }
}

For cells where cell_size >> rcut, very few particles will be in range, so this can safe significant time. Even when cell_size approaches rcut, this could be beneficial.

LonelyCat124 commented 3 years ago

My first though on cell grouping is something like modifying pair tasks to:
__demand(__leaf)
task pairwise_task( supercell1 : region(ispace(int1d, part)), supercell2 : region(ispace(int1d, part)), config : region(ispace(int1d, part)), allparts : region(ispace(int1d, part), cell_partition : partition(disjoint, allparts, ispace(int3d)), cell1 : int3d, cell2 : int3d)
where reads(supercell1...), reads(supercell2...), reduces+(supercell1...), reads(config...), reduces+(config...) do

--Loop through subcells inside supercell1 and interact with subcells in supercell2 in a "smart" way
end
and having 2 "cell partitions", based on 2 separate cell indexes in particles:
fspace neighbour_part{

 cell_id : int3d,
 supercell_id : int3d,
 x_cell : int1d,
 _valid : bool,
 _transfer_dir: int,
 _transfer_pos: int1d,
}
Whether the x_cell value should be x_supercell would need testing (but I'd lean towards probably yes). I would then have two levels of cells:

cell_id from {0,0,0} to {x,y,z}

supercell_id from {0,0,0} to {xx,yy,zz} where xx,yy,zz are some fraction of x,y,z. Since x,y,z are unlikely to be divisible, the last supercells would probably not be square, which may complicate matters slightly as well in the tasks, as we'd have to handle the last supercells in some sensible way (probably easiest is to explictly check if supercell_id.x*cells_per_supercell_x + x < x_count or similar).

Making this change also probably means that tradequeues want to work in a similar manner (on supercells instead of cells). I think there would be a tradequeue per supercell (so of size similar to the current tradequeues), however the buffer would be on a per-cell basis, so might need to be slightly larger than just dividing current buffer size by some number.

Tradequeues would then have two operations to do - managing movement inside a supercell as well as the tradequeue movement between supercells.

Ok, as well as this I think instead of having the concept of "pair" and "self" tasks, these would just be replaced by "self" tasks, which are also passed a halo region around them to interact with. This clearly only works for asymmetric operations, however that is fine for now. So we'd have something like

task self_task( supercell1 : region(...), halo : region(...), config : region(...), supercell_id : int3d, allparts : region(...), smallcells : partition(allparts, ...) ) where reads/write/reduces(supercell1...) reads(halo...) ...

LonelyCat124 commented 3 years ago

Ok, I have a minicode which demonstrates how this can be done.

We need to add 26 new fields to the neighbour_part_space which are used for initialising the partitions, but otherwise can be ignored. These fields contain the value of the supercell that this particle is in the halo of for each region, and can be computed as:

        -- -1, -1, -1
        var temp_supercell : int3d = int3d({ x_cell_m1 / cell_to_supercell_x, y_cell_m1 / cell_to_supercell_y, z_cell_m1 / cell_to_supercell_z })
        if temp_supercell == particles[part].supercell_id then
            --Not in a -1-1-1 halo
            particles[part].neighbour_cells.supercell_m1_m1_m1 = int3d({-1, -1, -1})
        else
            particles[part].neighbour_cells.supercell_m1_m1_m1 = temp_supercell
        end

I use the supercell -1, -1, -1 to denote all particles for whom the supercell in the direction of choice is the same supercell as they're contained in.

Once these values are set, the particle region is partitioned 26 times (for each direction) using the supercell space, and then a halo region is formed as the zipped union of these partitions:

    var halo_partition = m1_m1_m1_partition | m1_m1_0_partition  | m1_m1_p1_partition |
                         m1_0_m1_partition  | m1_0_0_partition   | m1_0_p1_partition  |
                         m1_p1_m1_partition | m1_p1_0_partition  | m1_p1_p1_partition |
                         _0_m1_m1_partition | _0_m1_0_partition  | _0_m1_p1_partition |
                         _0_0_m1_partition  |                      _0_0_p1_partition  |
                         _0_p1_m1_partition | _0_p1_0_partition  | _0_p1_p1_partition |
                         p1_m1_m1_partition | p1_m1_0_partition  | p1_m1_p1_partition |
                         p1_0_m1_partition  | p1_0_0_partition   | p1_0_p1_partition  |
                         p1_p1_m1_partition | p1_p1_0_partition  | p1_p1_p1_partition

I've tested this on the minicode, and the cells, supercells and halo regions contain the expected number of particles.

The pairwise tasks on each cell I expect to be of the form posted in the previous task. The best method for computing the actual interactions is still somethign to test, but at least a simple interaction can be done:

for x = xcell_min, xcell_max do
  for y = ycell_min, ycell_max do
    for z = zcell_min, zcell_max do
       var subcell = int3d({x, y, z})
       for each direction do
         if subcell + direction < cell_max && subcell + direction > cell_min then
           --loop over the particles in subcell and subcell+direction using indexes both from supercell1
         else
           --loop over the particles in subcell and subcell+direction using indexes from supercell1 and halo respectively
         end
       end
    end
  end
end

The final thing that needs to be done is to update the tradequeues. I think tradequeue_push and tradequeue_pull should be performed on a supercell level, i.e. tradequeues are created per supercell.

However, inside the tradequeue_pull operation the tradequeue_pull operation should again operate at a cell level, to ensure that all particles in a given cell are in the right cell, which ensures that the particles end up in the appropriate halos. This means padding should probably be computed on a per-cell basis, not per-supercell basis.

I'll probably start implementing this this afternoon, and hopefully can get it done berfore July 1st

LonelyCat124 commented 3 years ago

Ok, commit 56c7357 now has initialisation done for this, so at this point it does:

Assigns all particles (valid or not) a cell_id, supercell_id and the halo supercells in all directions.
Creates partitions at supercell, cell and halo levels.

Whats left to do is:

[x] 1. Update the compute_new_dests function to compute the cell and supercell of new particles
[x] 2. Update the tradequeue_push call to use the supercell partition.
[x] 3. Update the tradequeue_push call to base its decisions on supercells instead of cells.
[x] 4. Update the tradequeue_pull call to use the supercell partition, and also take in the cell partition as an input.
[x] 5. Update the tradequeue_pull call to place particles from the tradequeue into the correct cell, and also check its own cells for any particles in the wrong cell, and move them to the correct cell.
[x] 6. Update the neighbour_search tasks to use the new supercells and cell distinctions.

LonelyCat124 commented 3 years ago

Ok, so I've done 1-5 of the above, however I'm not sure I liked the way that code has ended up. Its likely significantly less efficient than the old tradequeue code. I hope to redo it tomorrow, with the following differences.

The new version compute_new_dests works nicely, and I'll probably push that tonight.
Create 1 tradequeue per CELL. These will be significantly smaller (however probably == padding size), probably resulting in more overall memory used.
What probably is also needed is then a tradequeue colouring for supercells and subcells, where the supercell partition contains all elements of the subcell partitions. This is a bit tricky for the ordering, so I'll have to think about how to do that.
Still launch tradequeue_push and tradequeue_pull on supercells, but take the subcell-tradequeue partition also as an input.
In tradequeue_push and tradequeue_pull the code is then essentially the same as the old tradequeue code, but instead of doing it once for the entire cell, it is instead done for each subcell in sequence.

So for the colorings, we'll need 2 functions as follows:

partition_tradequeue_by_supercells. This is essentially the current coloring code that works for supercells fine.
partition_tradequeue_by_subcell. This is a new function, and will have to take the subcell, compute its supercell, the subcell + offset and that offset's supercell, and then compute the oneD_color of the two supercells. Once it has the oneD_color of the supercells, it needs to find the oneD_color of this cell (and the cell+offsets) within their supercells. Probably I'll have to double check the maths of this first but it shouldn't be too bad.

LonelyCat124 commented 3 years ago

Ok I realised the previous comment is wrong..

Instead the partition_tradequeue_by_subcell should work based on the current coloring code. The partition_tradequeue_by_supercells needs to look at all of its subcells and color the relevant sections contained in this supercell by the appropriate colour

LonelyCat124 commented 3 years ago

Ok, so f87a02c contains an implementation of the new tradequeues, which pass all the DEBUG checks/assertion in my own code (which checks each particle is always in the correct cell). The remaining thing to do to improve performance (and what will make the majority of difference) is to have the neighbour search algorithm use the setup. I'll do this tomorrow and all should be ready to go and I'll test on the DL_MESO testcases for correct values.

As well as that, I want to make a "moving" test, where for some testcase the particles all move a fixed amount per step, and repeats N times, with the number of interactions being constant. This will ensure the movement is used and that any mechanisms used to manage particle locality inside the neighbour_search algorithm is also tested.

LonelyCat124 commented 3 years ago

Ok, the new system is now available in that branch. Everything appears to work well apart from tradequeue_pull, which I'm investigating.

A downside is the compile time is now 40-45 minutes, which is pretty long. Another thing to investigate.

LonelyCat124 commented 3 years ago

Trying with the larger testcase there appears to be a bug in tradequeue size allocation. Edit: seems to be a simple fix.