stfc / RegentParticleDSL

A particle-method DSL based on Regent programming language
MIT License
1 stars 0 forks source link

Pairwise task performance in HP tradequeues #95

Open LonelyCat124 opened 3 years ago

LonelyCat124 commented 3 years ago

Right now the main limiting factors of the performance of the code for the DL_MESO example is the pairwise task performance (and the inability to trace them, but thats a separate issue which requires input from Legion team).

Right now, the basis of the algorithm is:

  1. Divide the domain into cells of average size > 400/800.
  2. For each cell in the system, perform the self task in an n^2 way.
  3. For each pair of cells, perform the cross-cell interactions in an n^2 way.

This results in a couple of contrasting issues:

  1. Individual tasks are very expensive. Performing a naive n^2 operation at ~400 particles takes multiple milliseconds, and with the volume of cells in a large simulation (e.g. 512k particles), there are ~28k tasks per launch, which means runtime is guaranteed to be in the ~100s timeframe. With smaller cells where r_cell ~= r_cut the execution time of cells is much lower. Right now in dl_meso for the large testcase, the cell dimension are ~10.772174, whilst r_cut is 1.0, meaning for an average cell pair almost none of the particles involved are actually in range!
  2. With smaller tasks, overheads of launching tasks becomes increasingly problematic. This can be seen by reducing the ppc count to ~200 particles, then the runtime dramatically increases. This is for 2 reasons: 1) tradequeue_pull and tradequeue_push tasks are minute and overheads cripple their performance. 2) The actual tasks also get eaten up by runtime overheads (TBC).

The likely solution is then two-fold: 1) Grouping task launches to not simple cell to cell, but groups of cell to cell or groups of cells to groups of cells. 2) Improved neighbour search algorithm (such as sorting particles) to keep cell sizes larger.

LonelyCat124 commented 3 years ago

I'm still trying to find concrete evidence of the 2nd issue above for the pairwise tasks.

One thing that is particularly notable is that the tradequeue tasks perform horribly with few particles per cells - the time between execution massively outscaling the execution of tradequeue tasks.

One solution is to attempt to add more work to tradequeue tasks somehow (e.g. particle sorting).

LonelyCat124 commented 3 years ago

My first though on cell grouping is something like modifying pair tasks to:

__demand(__leaf)
task pairwise_task( supercell1 : region(ispace(int1d, part)), supercell2 : region(ispace(int1d, part)), config : region(ispace(int1d, part)), allparts : region(ispace(int1d, part), cell_partition : partition(disjoint, allparts, ispace(int3d)), cell1 : int3d, cell2 : int3d)
where reads(supercell1...), reads(supercell2...), reduces+(supercell1...), reads(config...), reduces+(config...) do

--Loop through subcells inside supercell1 and interact with subcells in supercell2 in a "smart" way
end

and having 2 "cell partitions", based on 2 separate cell indexes in particles:

fspace neighbour_part{

 cell_id : int3d,
 supercell_id : int3d,
 x_cell : int1d,
 _valid : bool,
 _transfer_dir: int,
 _transfer_pos: int1d,
}

Whether the x_cell value should be x_supercell would need testing (but I'd lean towards probably yes). I would then have two levels of cells:

  1. cell_id from {0,0,0} to {x,y,z}
  2. supercell_id from {0,0,0} to {xx,yy,zz} where xx,yy,zz are some fraction of x,y,z. Since x,y,z are unlikely to be divisible, the last supercells would probably not be square, which may complicate matters slightly as well in the tasks, as we'd have to handle the last supercells in some sensible way (probably easiest is to explictly check if supercell_id.x*cells_per_supercell_x + x < x_count or similar).

Making this change also probably means that tradequeues want to work in a similar manner (on supercells instead of cells). I think there would be a tradequeue per supercell (so of size similar to the current tradequeues), however the buffer would be on a per-cell basis, so might need to be slightly larger than just dividing current buffer size by some number.

Tradequeues would then have two operations to do - managing movement inside a supercell as well as the tradequeue movement between supercells.

LonelyCat124 commented 3 years ago

Sorted interactions can be combined or done independently of the above, and could be done on a supercell or cell level. In this example code I'm going to assume we don't have the above scheme implemented/used. It also might be dependent on cell sizes being >= cutoff radius (which seems likely for most cases anyway).

As with the tradequeues, we need some storage regions:

neighbour_init.SortArrays = terralib.newlist()
neighbour_init.SortArraysByCell = terralib.newlist()
for i = 1, 26 do
  neighbour_init.SortArrays:insert( regentlib.newsymbol() )
  neighbour_init.SortArraysByCell:insert( regentlib.newsymbol() )
end
....

  [(function() local __quotes = terralib.newlist()
     for i = 1,26 do
     __quotes:insert(rquote
       --Create a sort array for each direction globally, tot_parts includes padding for safety
       var sortedarray_indexSpace = ispace( int1d, tot_parts )
       var [neighbour_init.SortArrays[i]] = region(sortedarray_indexSpace, int1d); -- int1d might be some special field space for these values
       for point in [neighbour_init.SortArrays[i]].ispace do
           var x : int1d = --flatten padded_particle_array[point].neighbour_part_space.cell_id to a 1D value
           [neighbour_init.SortArrays[i]][point].x = x;
       end
      var tempspace = ispace(int1d, --max_x from loop);
       var [neighbour_init.SortArraysByCell[i]] = partition(complete, [neighbour_init.SortArrays[i]].x, tempspace);
     end)
   end
   return __quotes
   end) ()];

The premise is we have one for each direction between cell pairs (of which there are 26 directions), and like the tradequeues we use 1 to 26 to match their usecases. We also must make use of the same directions array , and also for sanity create a unit vector array that matches, i.e.:

local unit_vectors = terralib.newlist({
rexpr int3d({-5.773502691896258e-01, -5.773502691896258e-01, -5.773502691896258e-01}) end, --this is 1/sqrt(3)
rexpr int3d({-7.071067811865475e-01, -7.071067811865475e-01, 0}) end, --this is 1/sqrt(2)?
....

Notably one improvement that we could take from SWIFT is that SWIFT only stores 13 unit vectors , and then has two arrays to determine which to refer to, e.g. in sort_part.h these are sortlistID (which does direction to unit vector/sort list) and then runner_flip which says which direction to traverse the array (forward or backward). This means cells only need sorting and storing 13 times, which saves computation and storage space.

To actually do the sorting, the basic premise is contained within https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post126s2-file3.pdf , the idea being for each unit vector the particle's positions are projected onto the vector. I'm going to update this later with how exactly this works as its been a while and I forget.

LonelyCat124 commented 3 years ago

I'm pretty sure the way the sorted cell lists (pseudo-verlet lists) work, is that the particles are sorted in each cell based on the dot product of their position with each unit vector.

It can be shown that for a pair of particle pi and pj that if the distance between them along the unit vector is larger than the cutoff radius, then the distance between the particles in space is also larger than the cutoff.

For a given pair of cells, we can then minimise the neighbour searching:

for (int i = count_i-1; i >= 0; i--){
    for(int j = 0; i < count_j; j++){
        if( vector_pos[j] - vector_pos[i] > r2)
            --Exit both loops
        --Otherwise check if particles are actually in range in space
    }
}

For cells where cell_size >> rcut, very few particles will be in range, so this can safe significant time. Even when cell_size approaches rcut, this could be beneficial.

LonelyCat124 commented 3 years ago

My first though on cell grouping is something like modifying pair tasks to:

__demand(__leaf)
task pairwise_task( supercell1 : region(ispace(int1d, part)), supercell2 : region(ispace(int1d, part)), config : region(ispace(int1d, part)), allparts : region(ispace(int1d, part), cell_partition : partition(disjoint, allparts, ispace(int3d)), cell1 : int3d, cell2 : int3d)
where reads(supercell1...), reads(supercell2...), reduces+(supercell1...), reads(config...), reduces+(config...) do

--Loop through subcells inside supercell1 and interact with subcells in supercell2 in a "smart" way
end

and having 2 "cell partitions", based on 2 separate cell indexes in particles:

fspace neighbour_part{

 cell_id : int3d,
 supercell_id : int3d,
 x_cell : int1d,
 _valid : bool,
 _transfer_dir: int,
 _transfer_pos: int1d,
}

Whether the x_cell value should be x_supercell would need testing (but I'd lean towards probably yes). I would then have two levels of cells:

  1. cell_id from {0,0,0} to {x,y,z}
  2. supercell_id from {0,0,0} to {xx,yy,zz} where xx,yy,zz are some fraction of x,y,z. Since x,y,z are unlikely to be divisible, the last supercells would probably not be square, which may complicate matters slightly as well in the tasks, as we'd have to handle the last supercells in some sensible way (probably easiest is to explictly check if supercell_id.x*cells_per_supercell_x + x < x_count or similar).

Making this change also probably means that tradequeues want to work in a similar manner (on supercells instead of cells). I think there would be a tradequeue per supercell (so of size similar to the current tradequeues), however the buffer would be on a per-cell basis, so might need to be slightly larger than just dividing current buffer size by some number.

Tradequeues would then have two operations to do - managing movement inside a supercell as well as the tradequeue movement between supercells.

Ok, as well as this I think instead of having the concept of "pair" and "self" tasks, these would just be replaced by "self" tasks, which are also passed a halo region around them to interact with. This clearly only works for asymmetric operations, however that is fine for now. So we'd have something like

task self_task( supercell1 : region(...), halo : region(...), config : region(...), supercell_id : int3d, allparts : region(...), smallcells : partition(allparts, ...) ) where reads/write/reduces(supercell1...) reads(halo...) ...
LonelyCat124 commented 3 years ago

Ok, I have a minicode which demonstrates how this can be done.

We need to add 26 new fields to the neighbour_part_space which are used for initialising the partitions, but otherwise can be ignored. These fields contain the value of the supercell that this particle is in the halo of for each region, and can be computed as:

        -- -1, -1, -1
        var temp_supercell : int3d = int3d({ x_cell_m1 / cell_to_supercell_x, y_cell_m1 / cell_to_supercell_y, z_cell_m1 / cell_to_supercell_z })
        if temp_supercell == particles[part].supercell_id then
            --Not in a -1-1-1 halo
            particles[part].neighbour_cells.supercell_m1_m1_m1 = int3d({-1, -1, -1})
        else
            particles[part].neighbour_cells.supercell_m1_m1_m1 = temp_supercell
        end

I use the supercell -1, -1, -1 to denote all particles for whom the supercell in the direction of choice is the same supercell as they're contained in.

Once these values are set, the particle region is partitioned 26 times (for each direction) using the supercell space, and then a halo region is formed as the zipped union of these partitions:

    var halo_partition = m1_m1_m1_partition | m1_m1_0_partition  | m1_m1_p1_partition |
                         m1_0_m1_partition  | m1_0_0_partition   | m1_0_p1_partition  |
                         m1_p1_m1_partition | m1_p1_0_partition  | m1_p1_p1_partition |
                         _0_m1_m1_partition | _0_m1_0_partition  | _0_m1_p1_partition |
                         _0_0_m1_partition  |                      _0_0_p1_partition  |
                         _0_p1_m1_partition | _0_p1_0_partition  | _0_p1_p1_partition |
                         p1_m1_m1_partition | p1_m1_0_partition  | p1_m1_p1_partition |
                         p1_0_m1_partition  | p1_0_0_partition   | p1_0_p1_partition  |
                         p1_p1_m1_partition | p1_p1_0_partition  | p1_p1_p1_partition

I've tested this on the minicode, and the cells, supercells and halo regions contain the expected number of particles.

The pairwise tasks on each cell I expect to be of the form posted in the previous task. The best method for computing the actual interactions is still somethign to test, but at least a simple interaction can be done:

for x = xcell_min, xcell_max do
  for y = ycell_min, ycell_max do
    for z = zcell_min, zcell_max do
       var subcell = int3d({x, y, z})
       for each direction do
         if subcell + direction < cell_max && subcell + direction > cell_min then
           --loop over the particles in subcell and subcell+direction using indexes both from supercell1
         else
           --loop over the particles in subcell and subcell+direction using indexes from supercell1 and halo respectively
         end
       end
    end
  end
end

The final thing that needs to be done is to update the tradequeues. I think tradequeue_push and tradequeue_pull should be performed on a supercell level, i.e. tradequeues are created per supercell.

However, inside the tradequeue_pull operation the tradequeue_pull operation should again operate at a cell level, to ensure that all particles in a given cell are in the right cell, which ensures that the particles end up in the appropriate halos. This means padding should probably be computed on a per-cell basis, not per-supercell basis.

I'll probably start implementing this this afternoon, and hopefully can get it done berfore July 1st

LonelyCat124 commented 3 years ago

Ok, commit 56c7357 now has initialisation done for this, so at this point it does:

  1. Assigns all particles (valid or not) a cell_id, supercell_id and the halo supercells in all directions.
  2. Creates partitions at supercell, cell and halo levels.

Whats left to do is:

LonelyCat124 commented 3 years ago

Ok, so I've done 1-5 of the above, however I'm not sure I liked the way that code has ended up. Its likely significantly less efficient than the old tradequeue code. I hope to redo it tomorrow, with the following differences.

  1. The new version compute_new_dests works nicely, and I'll probably push that tonight.
  2. Create 1 tradequeue per CELL. These will be significantly smaller (however probably == padding size), probably resulting in more overall memory used.
  3. What probably is also needed is then a tradequeue colouring for supercells and subcells, where the supercell partition contains all elements of the subcell partitions. This is a bit tricky for the ordering, so I'll have to think about how to do that.
  4. Still launch tradequeue_push and tradequeue_pull on supercells, but take the subcell-tradequeue partition also as an input.
  5. In tradequeue_push and tradequeue_pull the code is then essentially the same as the old tradequeue code, but instead of doing it once for the entire cell, it is instead done for each subcell in sequence.

So for the colorings, we'll need 2 functions as follows:

  1. partition_tradequeue_by_supercells. This is essentially the current coloring code that works for supercells fine.
  2. partition_tradequeue_by_subcell. This is a new function, and will have to take the subcell, compute its supercell, the subcell + offset and that offset's supercell, and then compute the oneD_color of the two supercells. Once it has the oneD_color of the supercells, it needs to find the oneD_color of this cell (and the cell+offsets) within their supercells. Probably I'll have to double check the maths of this first but it shouldn't be too bad.
LonelyCat124 commented 3 years ago

Ok I realised the previous comment is wrong..

Instead the partition_tradequeue_by_subcell should work based on the current coloring code. The partition_tradequeue_by_supercells needs to look at all of its subcells and color the relevant sections contained in this supercell by the appropriate colour

LonelyCat124 commented 3 years ago

Ok, so f87a02c contains an implementation of the new tradequeues, which pass all the DEBUG checks/assertion in my own code (which checks each particle is always in the correct cell). The remaining thing to do to improve performance (and what will make the majority of difference) is to have the neighbour search algorithm use the setup. I'll do this tomorrow and all should be ready to go and I'll test on the DL_MESO testcases for correct values.

As well as that, I want to make a "moving" test, where for some testcase the particles all move a fixed amount per step, and repeats N times, with the number of interactions being constant. This will ensure the movement is used and that any mechanisms used to manage particle locality inside the neighbour_search algorithm is also tested.

LonelyCat124 commented 3 years ago

Ok, the new system is now available in that branch. Everything appears to work well apart from tradequeue_pull, which I'm investigating.

A downside is the compile time is now 40-45 minutes, which is pretty long. Another thing to investigate.

LonelyCat124 commented 3 years ago

Trying with the larger testcase there appears to be a bug in tradequeue size allocation. Edit: seems to be a simple fix.

LonelyCat124 commented 2 years ago

Ok, there is an apparently out-of-bounds access in this implementation - I'm trying to see why but it takes around 4 hours to build with -fbounds-checks 1, so may take a while.

The current detected on is in tradequeue_push when accessing the particles array through the subcell_partition[cell_id].ispace. I'm trying to see if it just happens on the first, or if there is some incorrect cell logic.

In terms of the performance problem on tradequeue_pull I had a revelation and think I worked out the cause (but not a solution yet):

A "1D" way to think of my problem is I have some partitions created with regentlib.c.legion_multi_domain_point_coloring, and 
for the 1st src partition, colour 0 could be {0,2}, {2,4}, {6,8} . For the 1st dst partition, colour 0 could be {1,3},{3,5},{7,9}.  The other 
colours are all disjoint but have similar ideas where the dst rects are +1 the src rects.

How I think this works is then the task launched on the src partition creates a PhysicalInstance  for the src partition for the index 
space {0,1,2,3,6,7}  and writes to (a subset of) that, with other index launches filliing other sections of the region. The task 
launched on the dest partition cannot find an existing PhysicalInstance  for the index space {1,2,3,4,7,8} so creates a new instance
 and then uses copy operations to fill that instance with the relevant data.

I hope there is a way to do this, but if not we can do tradequeues on a supercell level again and have more complex tasks to manage "inside" a supercell either as new tasks or inside tradequeue_pull

LonelyCat124 commented 2 years ago

A first run showed that it worked fine for supercell 0,0,0, but failed as soon as it got into supercell 1,0,0. I'm going to output the full index space for the supercells that are encountered now, and hopefully can work out what is happening.

LonelyCat124 commented 2 years ago

Found a bug, hopefully the only one...

LonelyCat124 commented 2 years ago

Set another bound trace run going.

As an extra safety measure, inside the debug code we should also assert that particles are in the correct supercell, and also that the supercell containing them does actually contain their cell.

LonelyCat124 commented 2 years ago

Found another bounds issue - I think I fixed it potentially but running more tests to confirm. Edit: More issues, still investigating

LonelyCat124 commented 2 years ago

Ok - the bounds issues are done and performance is better for the small testcase.

Current issue seems to be that this requires massive memory usage to allocate the Regions that are used for the TradeQueues, which I was not expecting. I'm not sure what happens in the other mapper where these are allocated, I assume they spend most of the time not being allocated in their entirety since the data is basically "expiring", but I'm not sure.

Tradequeues might need to move back to a supercell level to avoid this allocation, but it requires more working-space/effort to do the internal work.

LonelyCat124 commented 2 years ago

Trying to work out whats going on with the DL_meso issue I worked out a few things. 1) Tradequeue size does not need to be larger than padding space with the current implementation. 2) Particle size is approximately 24+24+9+4+4+4+8+8+8+8+8+8+8+8+8+32+32+8+4+4+8 plus 832 bytes for the halo thing, leading to a total of 229 data plus 832 bytes for internals, so 1061 bytes (or ~1KiB). 3) 64x64x64 cells, so 262144 tradequeues. Each tradequeue is (at the moment) of size at most 10, so we have 262144 26 10 KiB required, which is 68157440 KiB, or around 68 GiB. Note the original particles themselves only require 512kiB, with nearly 5x that space required for the dummy particles. Even combined this is only ~2.56GiB, so the tradequeues are using ~26x the memory of the actual particle code right now.

The confusing thing is this estimate seems wrong - I can't even run when requesting 170 GiB of memory so I need to investigate further here.

There are a few things I can think about doing to improve this short-term: 1) Tradequeues don't need to store the internal data used for the initial partitioning and then never used again. Given this is ~80% of the particle memory, finding a way to remove this would reduce memory usage by close to 80% overall. 2) Reduce frequency of tradequeue operations - they can be only done if any particle (globally) has moved far enough to invalidate the neighbour search criteria. I need to double check what this criteria is, but i think it is 0.5 * ( cell_size - cutoff). 3) Need to think of a better way to do this in general. If I'm creating padding of size 10 for cells with an average size of <2 particles initially, the code is very overhead centric, so will never perform particularly well.

LonelyCat124 commented 2 years ago

Performance on the large testcase doesn't look significantly improved after all this. I will try to view a profile for more information.

LonelyCat124 commented 2 years ago

Ok, so I can't get a profile on the large testcase. On a testcase of 1/10th of that size a timestep takes ~2500ms, with ~1300ms being to do with tradequeues. I applied point 1 from above to reduce memory usage significantly (the tradequeues technically still have these fields, but they should never be mapped). Tradequeue push took ~15ms, tradequeu pull took ~10ms, asym_pairwise_tasks took ~9ms, self tasks ~15ms

DL_MESO takes around 30ms per step right now, so there are some clear algorithmic costs that this code has. Its also clear there are significant improvements that can be done in the compute task code too. DL_POLY extracts all the neighbours and then goes over all the neighbours and computes the interactions. Maybe I can do something similar, but I think its important to try to find out how much time is being used in the kernel vs the neighbour search, which is tricky.

LonelyCat124 commented 2 years ago

Ok, so I added timers into the self and asym_pairwise tasks which aim to time the time spent in the actual kernels (measured with nanos so accuracy is not perfect) vs the time spent in the neighbour search: Asym pairwise one taken at random:

runtime was 9756us, interaction time was 168us

Self taken at random:

SELF: runtime was 14917us, interaction time was 266us

Its clear from this that these need some major improvement - the self tasks are taking roughly 55x as long to do useless work as useful work. Next priority is to improve these somehow - if we can reduce the runtime of these to ~1ms range it would be a significant improvement - I need to think how to do this (preferably, without sorting though sorting might solve a wide-range of issues).

LonelyCat124 commented 2 years ago

Ok, so a bit more info from the self interaction tasks makes it clear what the issue is:

SELF: runtime was 20653us, interaction time was 253us, distance time was 4380us
SELF: hits 2736 misses 134794
LonelyCat124 commented 2 years ago

Ok, so after an extensive bug hunt (caused by the reduction of the memory footprint), the self tasks are down to around 3.5ms, and pair tasks ~2ms, so saving 5-10x on these which is a big bonus. Looking at the outputs on the timing:

PAIR: runtime was 1978us, interaction time was 108us, distance time was 116us
PAIR: hits 656 misses 1930

SELF: runtime was 3793us, interaction time was 406us, distance time was 480us
SELF: hits 3856 misses 10810

There's still a lot of "missing" runtime, but some of this could be simply the accuracy of the timing, but the interaction vs distance computation time is now ~= which is a big bonus!

The big timestep cost is now the actual preparation of the cells, which is in-line with what I'd expect to see from a standard OpenMP code as well. The good news is we don't have to do this every step, but only once the particles have moved "enough". It is sufficient to only rebuild once a particle has moved move than 0.5 * (box_any_dimension - cutoff) in that dimension. The other thing we need to compute for this to work is the maximum distance a particle has moved, and then do a sorted search for all pairs within max_cutoff + max_distance - we find more superflous neighbours but sort/do tradequeues way less often.

I'm going to run the small testcase first to check correctness but then look to implement that.

LonelyCat124 commented 2 years ago

Ok, so the small testcase now runs in 3300s, which is significantly better but not good. There is definitely some issues with correctness though so I need to fix that next - I'll use a small pairwise example with some movement applied to all particles I think.

LonelyCat124 commented 2 years ago

Ok, the correctness issues all relate to periodic wrapping and sort distance. If you have two cells interacting over a periodic boundary, then the order you need to traverse through the cell is reversed, and the sort distances need wrapping as well.

I need to do some thinking of how to solve this.

LonelyCat124 commented 2 years ago

All test I can run now seem to be correct. However energies for a DL_MESO run are still not.