Open LonelyCat124 opened 3 years ago
I'm still trying to find concrete evidence of the 2nd issue above for the pairwise tasks.
One thing that is particularly notable is that the tradequeue tasks perform horribly with few particles per cells - the time between execution massively outscaling the execution of tradequeue tasks.
One solution is to attempt to add more work to tradequeue tasks somehow (e.g. particle sorting).
My first though on cell grouping is something like modifying pair tasks to:
__demand(__leaf)
task pairwise_task( supercell1 : region(ispace(int1d, part)), supercell2 : region(ispace(int1d, part)), config : region(ispace(int1d, part)), allparts : region(ispace(int1d, part), cell_partition : partition(disjoint, allparts, ispace(int3d)), cell1 : int3d, cell2 : int3d)
where reads(supercell1...), reads(supercell2...), reduces+(supercell1...), reads(config...), reduces+(config...) do
--Loop through subcells inside supercell1 and interact with subcells in supercell2 in a "smart" way
end
and having 2 "cell partitions", based on 2 separate cell indexes in particles:
fspace neighbour_part{
cell_id : int3d,
supercell_id : int3d,
x_cell : int1d,
_valid : bool,
_transfer_dir: int,
_transfer_pos: int1d,
}
Whether the x_cell
value should be x_supercell
would need testing (but I'd lean towards probably yes). I would then have two levels of cells:
cell_id
from {0,0,0}
to {x,y,z}
supercell_id
from {0,0,0}
to {xx,yy,zz}
where xx,yy,zz
are some fraction of x,y,z
. Since x,y,z
are unlikely to be divisible, the last supercells would probably not be square, which may complicate matters slightly as well in the tasks, as we'd have to handle the last supercells in some sensible way (probably easiest is to explictly check if supercell_id.x*cells_per_supercell_x + x < x_count
or similar).Making this change also probably means that tradequeues want to work in a similar manner (on supercells instead of cells). I think there would be a tradequeue per supercell (so of size similar to the current tradequeues), however the buffer would be on a per-cell basis, so might need to be slightly larger than just dividing current buffer size by some number.
Tradequeues would then have two operations to do - managing movement inside a supercell as well as the tradequeue movement between supercells.
Sorted interactions can be combined or done independently of the above, and could be done on a supercell or cell level. In this example code I'm going to assume we don't have the above scheme implemented/used. It also might be dependent on cell sizes being >= cutoff radius (which seems likely for most cases anyway).
As with the tradequeues, we need some storage regions:
neighbour_init.SortArrays = terralib.newlist()
neighbour_init.SortArraysByCell = terralib.newlist()
for i = 1, 26 do
neighbour_init.SortArrays:insert( regentlib.newsymbol() )
neighbour_init.SortArraysByCell:insert( regentlib.newsymbol() )
end
....
[(function() local __quotes = terralib.newlist()
for i = 1,26 do
__quotes:insert(rquote
--Create a sort array for each direction globally, tot_parts includes padding for safety
var sortedarray_indexSpace = ispace( int1d, tot_parts )
var [neighbour_init.SortArrays[i]] = region(sortedarray_indexSpace, int1d); -- int1d might be some special field space for these values
for point in [neighbour_init.SortArrays[i]].ispace do
var x : int1d = --flatten padded_particle_array[point].neighbour_part_space.cell_id to a 1D value
[neighbour_init.SortArrays[i]][point].x = x;
end
var tempspace = ispace(int1d, --max_x from loop);
var [neighbour_init.SortArraysByCell[i]] = partition(complete, [neighbour_init.SortArrays[i]].x, tempspace);
end)
end
return __quotes
end) ()];
The premise is we have one for each direction between cell pairs (of which there are 26 directions), and like the tradequeues we use 1 to 26 to match their usecases. We also must make use of the same directions
array , and also for sanity create a unit vector array that matches, i.e.:
local unit_vectors = terralib.newlist({
rexpr int3d({-5.773502691896258e-01, -5.773502691896258e-01, -5.773502691896258e-01}) end, --this is 1/sqrt(3)
rexpr int3d({-7.071067811865475e-01, -7.071067811865475e-01, 0}) end, --this is 1/sqrt(2)?
....
Notably one improvement that we could take from SWIFT is that SWIFT only stores 13 unit vectors , and then has two arrays to determine which to refer to, e.g. in sort_part.h
these are sortlistID
(which does direction to unit vector/sort list) and then runner_flip
which says which direction to traverse the array (forward or backward). This means cells only need sorting and storing 13 times, which saves computation and storage space.
To actually do the sorting, the basic premise is contained within https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post126s2-file3.pdf , the idea being for each unit vector the particle's positions are projected onto the vector. I'm going to update this later with how exactly this works as its been a while and I forget.
I'm pretty sure the way the sorted cell lists (pseudo-verlet lists) work, is that the particles are sorted in each cell based on the dot product of their position with each unit vector.
It can be shown that for a pair of particle pi
and pj
that if the distance between them along the unit vector is larger than the cutoff radius, then the distance between the particles in space is also larger than the cutoff.
For a given pair of cells, we can then minimise the neighbour searching:
for (int i = count_i-1; i >= 0; i--){
for(int j = 0; i < count_j; j++){
if( vector_pos[j] - vector_pos[i] > r2)
--Exit both loops
--Otherwise check if particles are actually in range in space
}
}
For cells where cell_size >> rcut, very few particles will be in range, so this can safe significant time. Even when cell_size approaches rcut, this could be beneficial.
My first though on cell grouping is something like modifying pair tasks to:
__demand(__leaf) task pairwise_task( supercell1 : region(ispace(int1d, part)), supercell2 : region(ispace(int1d, part)), config : region(ispace(int1d, part)), allparts : region(ispace(int1d, part), cell_partition : partition(disjoint, allparts, ispace(int3d)), cell1 : int3d, cell2 : int3d) where reads(supercell1...), reads(supercell2...), reduces+(supercell1...), reads(config...), reduces+(config...) do --Loop through subcells inside supercell1 and interact with subcells in supercell2 in a "smart" way end
and having 2 "cell partitions", based on 2 separate cell indexes in particles:
fspace neighbour_part{ cell_id : int3d, supercell_id : int3d, x_cell : int1d, _valid : bool, _transfer_dir: int, _transfer_pos: int1d, }
Whether the
x_cell
value should bex_supercell
would need testing (but I'd lean towards probably yes). I would then have two levels of cells:
cell_id
from{0,0,0}
to{x,y,z}
supercell_id
from{0,0,0}
to{xx,yy,zz}
wherexx,yy,zz
are some fraction ofx,y,z
. Sincex,y,z
are unlikely to be divisible, the last supercells would probably not be square, which may complicate matters slightly as well in the tasks, as we'd have to handle the last supercells in some sensible way (probably easiest is to explictly check ifsupercell_id.x*cells_per_supercell_x + x < x_count
or similar).Making this change also probably means that tradequeues want to work in a similar manner (on supercells instead of cells). I think there would be a tradequeue per supercell (so of size similar to the current tradequeues), however the buffer would be on a per-cell basis, so might need to be slightly larger than just dividing current buffer size by some number.
Tradequeues would then have two operations to do - managing movement inside a supercell as well as the tradequeue movement between supercells.
Ok, as well as this I think instead of having the concept of "pair" and "self" tasks, these would just be replaced by "self" tasks, which are also passed a halo region around them to interact with. This clearly only works for asymmetric operations, however that is fine for now. So we'd have something like
task self_task( supercell1 : region(...), halo : region(...), config : region(...), supercell_id : int3d, allparts : region(...), smallcells : partition(allparts, ...) ) where reads/write/reduces(supercell1...) reads(halo...) ...
Ok, I have a minicode which demonstrates how this can be done.
We need to add 26 new fields to the neighbour_part_space
which are used for initialising the partitions, but otherwise can be ignored. These fields contain the value of the supercell that this particle is in the halo of for each region, and can be computed as:
-- -1, -1, -1
var temp_supercell : int3d = int3d({ x_cell_m1 / cell_to_supercell_x, y_cell_m1 / cell_to_supercell_y, z_cell_m1 / cell_to_supercell_z })
if temp_supercell == particles[part].supercell_id then
--Not in a -1-1-1 halo
particles[part].neighbour_cells.supercell_m1_m1_m1 = int3d({-1, -1, -1})
else
particles[part].neighbour_cells.supercell_m1_m1_m1 = temp_supercell
end
I use the supercell -1, -1, -1
to denote all particles for whom the supercell in the direction of choice is the same supercell as they're contained in.
Once these values are set, the particle region is partitioned 26 times (for each direction) using the supercell space, and then a halo region is formed as the zipped union of these partitions:
var halo_partition = m1_m1_m1_partition | m1_m1_0_partition | m1_m1_p1_partition |
m1_0_m1_partition | m1_0_0_partition | m1_0_p1_partition |
m1_p1_m1_partition | m1_p1_0_partition | m1_p1_p1_partition |
_0_m1_m1_partition | _0_m1_0_partition | _0_m1_p1_partition |
_0_0_m1_partition | _0_0_p1_partition |
_0_p1_m1_partition | _0_p1_0_partition | _0_p1_p1_partition |
p1_m1_m1_partition | p1_m1_0_partition | p1_m1_p1_partition |
p1_0_m1_partition | p1_0_0_partition | p1_0_p1_partition |
p1_p1_m1_partition | p1_p1_0_partition | p1_p1_p1_partition
I've tested this on the minicode, and the cells, supercells and halo regions contain the expected number of particles.
The pairwise tasks on each cell I expect to be of the form posted in the previous task. The best method for computing the actual interactions is still somethign to test, but at least a simple interaction can be done:
for x = xcell_min, xcell_max do
for y = ycell_min, ycell_max do
for z = zcell_min, zcell_max do
var subcell = int3d({x, y, z})
for each direction do
if subcell + direction < cell_max && subcell + direction > cell_min then
--loop over the particles in subcell and subcell+direction using indexes both from supercell1
else
--loop over the particles in subcell and subcell+direction using indexes from supercell1 and halo respectively
end
end
end
end
end
The final thing that needs to be done is to update the tradequeues. I think tradequeue_push and tradequeue_pull should be performed on a supercell level, i.e. tradequeues are created per supercell.
However, inside the tradequeue_pull operation the tradequeue_pull operation should again operate at a cell level, to ensure that all particles in a given cell are in the right cell, which ensures that the particles end up in the appropriate halos. This means padding should probably be computed on a per-cell basis, not per-supercell basis.
I'll probably start implementing this this afternoon, and hopefully can get it done berfore July 1st
Ok, commit 56c7357 now has initialisation done for this, so at this point it does:
cell_id
, supercell_id
and the halo supercells in all directions.Whats left to do is:
compute_new_dests
function to compute the cell and supercell of new particlestradequeue_push
call to use the supercell partition.tradequeue_push
call to base its decisions on supercells instead of cells.tradequeue_pull
call to use the supercell partition, and also take in the cell partition as an input.tradequeue_pull
call to place particles from the tradequeue into the correct cell, and also check its own cells for any particles in the wrong cell, and move them to the correct cell.neighbour_search
tasks to use the new supercells and cell distinctions.Ok, so I've done 1-5 of the above, however I'm not sure I liked the way that code has ended up. Its likely significantly less efficient than the old tradequeue code. I hope to redo it tomorrow, with the following differences.
compute_new_dests
works nicely, and I'll probably push that tonight.tradequeue_push
and tradequeue_pull
on supercells, but take the subcell-tradequeue partition also as an input.tradequeue_push
and tradequeue_pull
the code is then essentially the same as the old tradequeue code, but instead of doing it once for the entire cell, it is instead done for each subcell in sequence.So for the colorings, we'll need 2 functions as follows:
partition_tradequeue_by_supercells
. This is essentially the current coloring code that works for supercells fine.partition_tradequeue_by_subcell
. This is a new function, and will have to take the subcell, compute its supercell, the subcell + offset and that offset's supercell, and then compute the oneD_color of the two supercells. Once it has the oneD_color of the supercells, it needs to find the oneD_color of this cell (and the cell+offsets) within their supercells. Probably I'll have to double check the maths of this first but it shouldn't be too bad.Ok I realised the previous comment is wrong..
Instead the partition_tradequeue_by_subcell
should work based on the current coloring code.
The partition_tradequeue_by_supercells
needs to look at all of its subcells and color the relevant sections contained in this supercell by the appropriate colour
Ok, so f87a02c contains an implementation of the new tradequeues, which pass all the DEBUG checks/assertion in my own code (which checks each particle is always in the correct cell). The remaining thing to do to improve performance (and what will make the majority of difference) is to have the neighbour search algorithm use the setup. I'll do this tomorrow and all should be ready to go and I'll test on the DL_MESO testcases for correct values.
As well as that, I want to make a "moving" test, where for some testcase the particles all move a fixed amount per step, and repeats N times, with the number of interactions being constant. This will ensure the movement is used and that any mechanisms used to manage particle locality inside the neighbour_search algorithm is also tested.
Ok, the new system is now available in that branch. Everything appears to work well apart from tradequeue_pull, which I'm investigating.
A downside is the compile time is now 40-45 minutes, which is pretty long. Another thing to investigate.
Trying with the larger testcase there appears to be a bug in tradequeue size allocation. Edit: seems to be a simple fix.
Ok, there is an apparently out-of-bounds access in this implementation - I'm trying to see why but it takes around 4 hours to build with -fbounds-checks 1
, so may take a while.
The current detected on is in tradequeue_push
when accessing the particles
array through the subcell_partition[cell_id].ispace
. I'm trying to see if it just happens on the first, or if there is some incorrect cell logic.
In terms of the performance problem on tradequeue_pull
I had a revelation and think I worked out the cause (but not a solution yet):
A "1D" way to think of my problem is I have some partitions created with regentlib.c.legion_multi_domain_point_coloring, and
for the 1st src partition, colour 0 could be {0,2}, {2,4}, {6,8} . For the 1st dst partition, colour 0 could be {1,3},{3,5},{7,9}. The other
colours are all disjoint but have similar ideas where the dst rects are +1 the src rects.
How I think this works is then the task launched on the src partition creates a PhysicalInstance for the src partition for the index
space {0,1,2,3,6,7} and writes to (a subset of) that, with other index launches filliing other sections of the region. The task
launched on the dest partition cannot find an existing PhysicalInstance for the index space {1,2,3,4,7,8} so creates a new instance
and then uses copy operations to fill that instance with the relevant data.
I hope there is a way to do this, but if not we can do tradequeues on a supercell level again and have more complex tasks to manage "inside" a supercell either as new tasks or inside tradequeue_pull
A first run showed that it worked fine for supercell 0,0,0
, but failed as soon as it got into supercell 1,0,0
.
I'm going to output the full index space for the supercells that are encountered now, and hopefully can work out what is happening.
Found a bug, hopefully the only one...
Set another bound trace run going.
As an extra safety measure, inside the debug code we should also assert that particles are in the correct supercell, and also that the supercell containing them does actually contain their cell.
Found another bounds issue - I think I fixed it potentially but running more tests to confirm. Edit: More issues, still investigating
Ok - the bounds issues are done and performance is better for the small testcase.
Current issue seems to be that this requires massive memory usage to allocate the Regions that are used for the TradeQueues, which I was not expecting. I'm not sure what happens in the other mapper where these are allocated, I assume they spend most of the time not being allocated in their entirety since the data is basically "expiring", but I'm not sure.
Tradequeues might need to move back to a supercell level to avoid this allocation, but it requires more working-space/effort to do the internal work.
Trying to work out whats going on with the DL_meso issue I worked out a few things. 1) Tradequeue size does not need to be larger than padding space with the current implementation. 2) Particle size is approximately 24+24+9+4+4+4+8+8+8+8+8+8+8+8+8+32+32+8+4+4+8 plus 832 bytes for the halo thing, leading to a total of 229 data plus 832 bytes for internals, so 1061 bytes (or ~1KiB). 3) 64x64x64 cells, so 262144 tradequeues. Each tradequeue is (at the moment) of size at most 10, so we have 262144 26 10 KiB required, which is 68157440 KiB, or around 68 GiB. Note the original particles themselves only require 512kiB, with nearly 5x that space required for the dummy particles. Even combined this is only ~2.56GiB, so the tradequeues are using ~26x the memory of the actual particle code right now.
The confusing thing is this estimate seems wrong - I can't even run when requesting 170 GiB of memory so I need to investigate further here.
There are a few things I can think about doing to improve this short-term:
1) Tradequeues don't need to store the internal data used for the initial partitioning and then never used again. Given this is ~80% of the particle memory, finding a way to remove this would reduce memory usage by close to 80% overall.
2) Reduce frequency of tradequeue operations - they can be only done if any particle (globally) has moved far enough to invalidate the neighbour search criteria. I need to double check what this criteria is, but i think it is 0.5 * ( cell_size - cutoff)
.
3) Need to think of a better way to do this in general. If I'm creating padding of size 10 for cells with an average size of <2 particles initially, the code is very overhead centric, so will never perform particularly well.
Performance on the large testcase doesn't look significantly improved after all this. I will try to view a profile for more information.
Ok, so I can't get a profile on the large testcase. On a testcase of 1/10th of that size a timestep takes ~2500ms, with ~1300ms being to do with tradequeues. I applied point 1 from above to reduce memory usage significantly (the tradequeues technically still have these fields, but they should never be mapped). Tradequeue push took ~15ms, tradequeu pull took ~10ms, asym_pairwise_tasks took ~9ms, self tasks ~15ms
DL_MESO takes around 30ms per step right now, so there are some clear algorithmic costs that this code has. Its also clear there are significant improvements that can be done in the compute task code too. DL_POLY extracts all the neighbours and then goes over all the neighbours and computes the interactions. Maybe I can do something similar, but I think its important to try to find out how much time is being used in the kernel vs the neighbour search, which is tricky.
Ok, so I added timers into the self and asym_pairwise tasks which aim to time the time spent in the actual kernels (measured with nanos so accuracy is not perfect) vs the time spent in the neighbour search: Asym pairwise one taken at random:
runtime was 9756us, interaction time was 168us
Self taken at random:
SELF: runtime was 14917us, interaction time was 266us
Its clear from this that these need some major improvement - the self tasks are taking roughly 55x as long to do useless work as useful work. Next priority is to improve these somehow - if we can reduce the runtime of these to ~1ms range it would be a significant improvement - I need to think how to do this (preferably, without sorting though sorting might solve a wide-range of issues).
Ok, so a bit more info from the self interaction tasks makes it clear what the issue is:
SELF: runtime was 20653us, interaction time was 253us, distance time was 4380us
SELF: hits 2736 misses 134794
Ok, so after an extensive bug hunt (caused by the reduction of the memory footprint), the self tasks are down to around 3.5ms, and pair tasks ~2ms, so saving 5-10x on these which is a big bonus. Looking at the outputs on the timing:
PAIR: runtime was 1978us, interaction time was 108us, distance time was 116us
PAIR: hits 656 misses 1930
SELF: runtime was 3793us, interaction time was 406us, distance time was 480us
SELF: hits 3856 misses 10810
There's still a lot of "missing" runtime, but some of this could be simply the accuracy of the timing, but the interaction vs distance computation time is now ~= which is a big bonus!
The big timestep cost is now the actual preparation of the cells, which is in-line with what I'd expect to see from a standard OpenMP code as well. The good news is we don't have to do this every step, but only once the particles have moved "enough". It is sufficient to only rebuild once a particle has moved move than 0.5 * (box_any_dimension - cutoff)
in that dimension. The other thing we need to compute for this to work is the maximum distance a particle has moved, and then do a sorted search for all pairs within max_cutoff + max_distance - we find more superflous neighbours but sort/do tradequeues way less often.
I'm going to run the small testcase first to check correctness but then look to implement that.
Ok, so the small testcase now runs in 3300s, which is significantly better but not good. There is definitely some issues with correctness though so I need to fix that next - I'll use a small pairwise example with some movement applied to all particles I think.
Ok, the correctness issues all relate to periodic wrapping and sort distance. If you have two cells interacting over a periodic boundary, then the order you need to traverse through the cell is reversed, and the sort distances need wrapping as well.
I need to do some thinking of how to solve this.
All test I can run now seem to be correct. However energies for a DL_MESO run are still not.
Right now the main limiting factors of the performance of the code for the DL_MESO example is the pairwise task performance (and the inability to trace them, but thats a separate issue which requires input from Legion team).
Right now, the basis of the algorithm is:
This results in a couple of contrasting issues:
The likely solution is then two-fold: 1) Grouping task launches to not simple cell to cell, but groups of cell to cell or groups of cells to groups of cells. 2) Improved neighbour search algorithm (such as sorting particles) to keep cell sizes larger.