WIP: Dl meso io and base implementation

LonelyCat124 commented 3 years ago

TODO list:

[x] Scan field
[x] Read field
[x] Scan control
[x] Read control
[x] DL_MESO specific config field space.
[ ] Modularise read and writes so we don't mess up the global namespace.
[x] Switch from strcmp to strncmp to match exactly the DL_MESO fortran string comparisons
[x] Add regentlib.assert(false, ...) to throw an error when encountering a not yet supported keyword in the input files.
[x] Write CORREL
[x] Write OUTPUT
[x] Write HISTORY
[x] Write REVIVE
[x] Write export (?)
[ ] Check the outputs are correctly laid out / fix them to be the same as Fortran if required.
[x] Kernels have access to the config type ( #84 )
[x] Kernel implementations
[x] Main function.
[x] Fixing #86
[x] Fixing #88
[x] Fixing #89

Would be nice to also get documentation before merging but may be tight on timescale I want to do this...

LonelyCat124 commented 3 years ago

Added in most of the output routines (untested) - the remaining functions are write_output_result onwards.

LonelyCat124 commented 3 years ago

nspec currently doesn't seem to be set in read_field. Edit: Looks like I just missed this. Line 257 in my file. Line 2061 in dl_meso's read_module.F90

LonelyCat124 commented 3 years ago

Hoping to have the initialisation function done today - that required WAY more code than expected.

LonelyCat124 commented 3 years ago

Ok, so a whole bunch of files have now been added to create the initialisation function.

This doesn't quite yet work, as certain things are not being read in correctly. Notably:

 species data

                population        mass            charge            frozen
          PLE1               0    1.000000e+00    0.000000e+00           F                                                                                                           
          PLE2               0    1.000000e+00    0.000000e+00           F
          PLE3               0    1.000000e+00    0.000000e+00           F

This should be 333 333 334 for population.

Following from this, the energy parameters are also incorrect (all 0 when they shouldn't be, but this could be due to the previous issue).

Dissipative parameters are also incorrect as non-self interactions have viscosity 0 (instead of 1) and there is no "random force" values.

Finally the statistics are all -nan - however this is likely due to there being no particles with values in the system.

Edit: energy parameters now have values but are still wrong.

LonelyCat124 commented 3 years ago

Ok, so this afternoon I've fixed most of the bugs with the initialisation of the system - the remaining bugs I'm certain of are:

initial particle positions appear to all be 0. My current guess is that nplz, nply and nplx are all wrong, leading to disx and disy being out by a factor of 100, and then disz is just wrong - these values should all be the same at ~1.0322801154563670
Initial particle velocities are different (which is fine) but may be too large sometimes - in DL_MESO I see no abs velocity value above ~1.51, whilst in the Regent code there is no value below 0.1 (all are in the range 1-10). This may be an artifact of the first bug though so I'll wait to see.
Virial, Kinetic energy and some other values are not being set or summed correctly. Again this may be related to the first bug.

LonelyCat124 commented 3 years ago

Initial temperature scaling is nmow fixed so things look sensible. However the virial and pressure do not yet match - i believe these are independent from the random number generation so I'll look into this next.

LonelyCat124 commented 3 years ago

Particle with id 0 exists - check why

LonelyCat124 commented 3 years ago

Having some issues with IO functions that haven't yet been used (e.g. HISTORY file). However the implementation of science seems approximately correct now!

LonelyCat124 commented 3 years ago

Ok, I found the bug I think, I wasn't checking to only output "valid" particles.

As part of the solution all neighbour search algorithms must provide a function as follows:

__demand(__inline)
task neighbour_init.check_valid(ns : neighbour_part)

    return "BOOLEAN"

end

The structure of this could still change - however I believe the requirement of the task to be inlined should avoid it being costly. For neighbour search algorithms that don't create "extra" particles this can just be return true, for the tradequeues it checks ns._valid.

LonelyCat124 commented 3 years ago

As a note - the dumps to the OUTPUT file are not guaranteed to be correct now, particularly in write_output_result we should check for validity.

LonelyCat124 commented 3 years ago

Ok, initial profiling of the DL_MESO implementation.

First thing - each timestep is ~1.5s, and avg CPU usage is very low.

self_tasks have between 50-150 interactions per cell, which run in roughly 83us (far below the ideal 1ms). The time between the start of self_tasks is still on the order of 1ms.
Most pairwise tasks have 0 interactions per pair, and run in ~100us (below the ideal 1ms). Additionally, the gap between these tasks is ~3ms to ~10ms. Most of this time is the "Task Physical Dependence Analysis" (~2.3ms) followed by "map task" (~1.6ms)
Once the tasks complete, there is a long "Copy Fill Aggregation` in the utility processors, during which nothing happens in the CPU code.. Presumably this is handling the reduction operations.

One possible issue with this testcase is that perhaps the particle count is just way too low for this methodology/algorithm implementation. This case has 1000 particles total, which is small enough that a single cell per timestep would probably be manageable. I am going to double-check how the performance of a naive n^2 implementation behaves if I get chance. (Note that the code currently aims for ~800 ppc so 1000 is basically what we expect n^2 to take 1ms to run).

The other thing to test is the Mixture_Large testcase. That example features approx half a million so likely would be much more useful - it also takes ~1s/step for DL_MESO itself (around 30 minutes total and runs 3k steps).

LonelyCat124 commented 3 years ago

Ok, so for the Mixture_Small the most naive n^2 implementation takes ~5ms/step (compared to dl_meso 0.4/step) which is SO much better, so its safe to conclude that for such small examples we need more trivial algorithms. The main search loop takes ~3.5ms, which given the naive algorithm seems fine to me.

I plan to run the Mixture_Large soon therefore to try to get better performance from the real algorithms.

LonelyCat124 commented 3 years ago

2D array accesses in Terra are reversed, so where in C you would do a[i][j] in terra you need to do a[j][i] - This is wrong throughout the DL_MESO implementation and may cause issues.

LonelyCat124 commented 3 years ago

Ok, so performance. Right now performance is pretty bad - notably for the large testcase the memory allocation for the reductions is dominating the actual runtime. Neither cpu nor util processors have high % utilisation - memory allocation is bottleneck?

Need to think about mapping and if we can solve this in some way.

Mapping: Machine model - mapper needs to know what processors/memories exist, and layout relative to each other. Right now on a single node probably don't need to play around with this. Processors - LOC(CPU), TOC (GPU), UTILITY(utility cores, LOC for runtime), IO (for IO processes - could be useful later?), PROC_SET (processor groups, e.g. NUMA regions), OPENMP (Not used yet in the DSL). Memories - GLOBAL (full machine memory for GASNet), SYSTEM (Local system RAM). FRAME_BUFFER/ZERO_COPY (for GPUs), DISK(can be used for out-of-core)/HDF5. Affinities - Processor->Memory, Memory->Memory, Memory-Processor - these are just lists of affinities.

Task Variants - multiple variants of tasks can exist, e.g. GPU/CPU implementations. Not supported yet in DSL. Variants can have associated constraints.

Physical instance is a copy of the data corresponding to a region. There can be 0, 1 or many instances of a region at any time. These can be valid or invalid, live in a specific memory, have a specific layour, are allocated explicitly by the mapper and are garbage collected by the runtime. Multiple physical instance of a region can exist simultaneously including DIFFERENT VERSIONS of the same data.

Two stages to mapper - first logical analysis, followed by decision on whether to make a copy of the data to break dependencies.

Mapper calls: Picking a processor:

Select task options
Slice task - break up index luanches into chunks and distribute. Fixes the node of the task
Map task - bind task to a processor.

Layout constraints: Tasks can have layout constraints on physical instances, e.g. This task requires data in a row major order Constants let you do that - don't specify an exact layout, and multiple instances may satisfy the constraints.

Selecting Physical instances:

The default mapper checks if an existing instance exists, if so and it has affinity to the processor then it gives that.
Otherwise it creates a new instance.
However, reduction instances are always create new - never reused. Extra bad for GPU tasks.

Virtual Mappings: Mapper can choose to map a region to no instance if the task does not use the region itself.

Index Launches:

Regent will look for loops with task launches that don't interfere and index launch those.

Control replication:

Come back to this.
Without it - width of index task launches increases with size of machine.

LonelyCat124 commented 3 years ago

Fixed array indexing issue.

Tried implementing an index launched version for asymmetric tasks, currently it doesn't appear to be running on scafellpike at all (not building) so I need to work out what I messed up.

LonelyCat124 commented 3 years ago

Fixed the issue. Asymmetric tasks aren't generally supported in most neighbour search systems right now, only the HP one.

LonelyCat124 commented 3 years ago

Ok, so it looks like c18749d will need to be reverted (it broke everything). Array indexing is still an issue to be solved, though its possible it could be solved by reversing the indexes when defining arrays (testing this now). Its fixed, and will be pushed soon.

LonelyCat124 commented 3 years ago

Ok, I pushed some new things.

Array indexing issue is fixed.

First mappers have been pushed. The tradequeue mapper is just the default mapper from Legion and I don't plan to change much about it at the current time.

The high_perform mapper aims to do the following:

Have some state where I can assign tasks on a given cell (partition element - can I check this during mapping?) to a preferred (set of) LOCs. This probably depends on memory structure/TOC(GPU) presence/etc. at a later date.
For a given LOC & cell combination, create a single PhysicalInstance of a reduction clause that is reused during a timestep. Not sure how to define a "timestep" in this context, particularly as in each real timestep in general the above sections (UNIT A and UNIT B) could appear multiple times in any combination of orders with different kernels inside tasks, but never broken up. I think there's some concept of epochs in Legion but I'm not quite clear on them.

At this point, map_task supposedly tries to reuse reduction clauses for the same processor. However this (for sure) is probably broken. I need to dive into default_create_custom_instance and friends to check they don't do something to override what map_task wants to do. I also need to ensure that the GC doesn't just delete the reduction instances as soon as they're done with.

LonelyCat124 commented 3 years ago

The new map_task is much better on memory usage, still some performance issues remain.

The upcoming commit adds functionality to resolve #86

LonelyCat124 commented 3 years ago

Added some additional documentation.

LonelyCat124 commented 3 years ago

The remaining 3 points are ok to leave until I have more time to work on this sector. This is ready to merge if/when other projects start that might require the feature set, but for now will be left open to continue performance enhancement.

LonelyCat124 commented 3 years ago

Manually squashed commits to attempt to better explain the processes done during this pull request.

LonelyCat124 commented 3 years ago

This branch is now good to go, but is waiting on any of the following criteria before merging:

Improved performance. This likely comes through working tracing functionality, better per-cell sizes, combining elements or improved algorithms.
Beginning new projects relying on infrastructure built into this branch (likely, 1 July start date (ish) )

LonelyCat124 commented 3 years ago

In terms of performance, I tried (without tracing) different cell sizes on the large dl_meso volume for 2 timesteps: ppc >= 800, 8x8x8 cells, 189.9s ppc >= 1600, 8x8x4 cells, 334.3s ppc >= 400, 17x8x8 cells, 178.6s ppc >= 200, 17x17x8 cells, 750.4s

My assumptions at this point (without checking profiling) is that the limits are:

task runtime too high (ppc >= 1600)
task runtime limitation (100% util, 800/400 ppc)
task runtime too low/too many tasks (200 ppc)
Cost of reductions (?)

I'm setting a profile running with 400 ppc to hopefully try to find what is limiting the performance. Notably DL_MESO takes <0.6s per step for this testcase - I'm planning to set a run of DL_MESO on scafell pike to find the main hotspots also.

LonelyCat124 commented 3 years ago

Ok, so at a basic high level, the dl_meso executable breaks down into: 46% forces_mdvv 31% diff (Check what this does) 7% initialize

I realised that I messed up and didn't compile with -g, so rerunning to get more detail.

LonelyCat124 commented 3 years ago

Ok, so having another look there are some clear performance issues outside of the kernel performance themselves in the DSL.

compute_new_dests and other tradequeue operations have massively low CPU utilisation. These are not currently index launched, so we can see if thats possible.
per_part_tasks are in general significantly lighter than their pairwise counterparts. This also results in low CPU utilisation, despite index launches. It might be ideal to find a way to group these tasks into cell-groups. This probably requires some more colour based partitions, and changes to the mapper to be able to use though.

One thing that could potentially also be bad for these tasks for full simulations is the mapper just treats both in some "default" way, which could mean no reuse. Another thing to look into.

As for the main workload tasks (pairwise operations), the CPU utilisation is high for these. What I believe to be the issue is computing an n^2 operation for 400 ppc (plus tradequeue stuff) is way too many. DL_MESO uses 68x68x68 cells for this operation, which results in 1.6 particles per cell. I've set a run going with DL_MESO for 17x8x8 to see what happens, and you see massively worse performance, with runtime growing dramtically, with ~60s for 10 timesteps (instead of the 2-3s seen with its desired cell counts).

On the other hand, SWIFT splits cells down to under 400 ppc, and gets great performance, but has significantly better algorithms than are implemented here, which allows task performance to be improved significantly. Further SWIFT does not have to launch tasks in the same way, i.e. it doesn't have to deal with data location, creation of task objects or computing dependencies dynamically. These costs are paid periodically at rebuilds, instead of during the main task runtime. This means tasks can be lighter weight than the tasks in the DSL (though perhaps with tracing this is not true).

Few things to think about just from that on how I might want to improve performance. Being able to do task grouping, particle sorting & other solutions could significantly improve performance potentially.

LonelyCat124 commented 3 years ago

Need to update the safe_to_combine code to take into account reads/writes/reductions to config spaces.

I think any write or reduction operator to a config should disable combining.

LonelyCat124 commented 3 years ago

Previous comment only holds if the following kernel makes use of that data.

LonelyCat124 commented 3 years ago

So the previous comments have been implemented now, and the latest commits are aiming at potential performance improvements by merging some launches to be on larger spatial regions instead of individual cells. This is mostly aimed at lighter weight tasks (such as compute_new_dests and PER_PART tasks) which are very light individually.

LonelyCat124 commented 3 years ago

Ok - the new merged launches of those specific types looks like it is a significant improvement for those sections (overall minimally noticable). I think the main thing is still to reduce either the cell sizes or computation required per cell, but I want to merge this before looking at that in its own branch. I'm going to modularise the read and write modules and merge this branch

stfc / RegentParticleDSL

WIP: Dl meso io and base implementation #85