Closed LonelyCat124 closed 3 years ago
Added in most of the output routines (untested) - the remaining functions are write_output_result
onwards.
nspec currently doesn't seem to be set in read_field. Edit: Looks like I just missed this. Line 257 in my file. Line 2061 in dl_meso's read_module.F90
Hoping to have the initialisation function done today - that required WAY more code than expected.
Ok, so a whole bunch of files have now been added to create the initialisation function.
This doesn't quite yet work, as certain things are not being read in correctly. Notably:
species data
population mass charge frozen
PLE1 0 1.000000e+00 0.000000e+00 F
PLE2 0 1.000000e+00 0.000000e+00 F
PLE3 0 1.000000e+00 0.000000e+00 F
This should be 333 333 334 for population.
Following from this, the energy parameters are also incorrect (all 0 when they shouldn't be, but this could be due to the previous issue).
Dissipative parameters are also incorrect as non-self interactions have viscosity 0 (instead of 1) and there is no "random force" values.
Finally the statistics are all -nan - however this is likely due to there being no particles with values in the system.
Edit: energy parameters now have values but are still wrong.
Ok, so this afternoon I've fixed most of the bugs with the initialisation of the system - the remaining bugs I'm certain of are:
disx
and disy
being out by a factor of 100, and then disz
is just wrong - these values should all be the same at ~1.0322801154563670
Initial temperature scaling is nmow fixed so things look sensible. However the virial and pressure do not yet match - i believe these are independent from the random number generation so I'll look into this next.
Particle with id 0 exists - check why
Having some issues with IO functions that haven't yet been used (e.g. HISTORY file). However the implementation of science seems approximately correct now!
Ok, I found the bug I think, I wasn't checking to only output "valid" particles.
As part of the solution all neighbour search algorithms must provide a function as follows:
__demand(__inline)
task neighbour_init.check_valid(ns : neighbour_part)
return "BOOLEAN"
end
The structure of this could still change - however I believe the requirement of the task to be inlined should avoid it being costly. For neighbour search algorithms that don't create "extra" particles this can just be return true
, for the tradequeues it checks ns._valid
.
As a note - the dumps to the OUTPUT file are not guaranteed to be correct now, particularly in write_output_result
we should check for validity.
Ok, initial profiling of the DL_MESO implementation.
First thing - each timestep is ~1.5s, and avg CPU usage is very low.
One possible issue with this testcase is that perhaps the particle count is just way too low for this methodology/algorithm implementation. This case has 1000 particles total, which is small enough that a single cell per timestep would probably be manageable. I am going to double-check how the performance of a naive n^2 implementation behaves if I get chance. (Note that the code currently aims for ~800 ppc so 1000 is basically what we expect n^2 to take 1ms to run).
The other thing to test is the Mixture_Large testcase. That example features approx half a million so likely would be much more useful - it also takes ~1s/step for DL_MESO itself (around 30 minutes total and runs 3k steps).
Ok, so for the Mixture_Small the most naive n^2 implementation takes ~5ms/step (compared to dl_meso 0.4/step) which is SO much better, so its safe to conclude that for such small examples we need more trivial algorithms. The main search loop takes ~3.5ms, which given the naive algorithm seems fine to me.
I plan to run the Mixture_Large soon therefore to try to get better performance from the real algorithms.
2D array accesses in Terra are reversed, so where in C you would do a[i][j]
in terra you need to do a[j][i]
- This is wrong throughout the DL_MESO implementation and may cause issues.
Ok, so performance. Right now performance is pretty bad - notably for the large testcase the memory allocation for the reductions is dominating the actual runtime. Neither cpu nor util processors have high % utilisation - memory allocation is bottleneck?
Need to think about mapping and if we can solve this in some way.
Mapping: Machine model - mapper needs to know what processors/memories exist, and layout relative to each other. Right now on a single node probably don't need to play around with this. Processors - LOC(CPU), TOC (GPU), UTILITY(utility cores, LOC for runtime), IO (for IO processes - could be useful later?), PROC_SET (processor groups, e.g. NUMA regions), OPENMP (Not used yet in the DSL). Memories - GLOBAL (full machine memory for GASNet), SYSTEM (Local system RAM). FRAME_BUFFER/ZERO_COPY (for GPUs), DISK(can be used for out-of-core)/HDF5. Affinities - Processor->Memory, Memory->Memory, Memory-Processor - these are just lists of affinities.
Task Variants - multiple variants of tasks can exist, e.g. GPU/CPU implementations. Not supported yet in DSL. Variants can have associated constraints.
Physical instance is a copy of the data corresponding to a region. There can be 0, 1 or many instances of a region at any time. These can be valid or invalid, live in a specific memory, have a specific layour, are allocated explicitly by the mapper and are garbage collected by the runtime. Multiple physical instance of a region can exist simultaneously including DIFFERENT VERSIONS of the same data.
Two stages to mapper - first logical analysis, followed by decision on whether to make a copy of the data to break dependencies.
Mapper calls: Picking a processor:
Layout constraints: Tasks can have layout constraints on physical instances, e.g. This task requires data in a row major order Constants let you do that - don't specify an exact layout, and multiple instances may satisfy the constraints.
Selecting Physical instances:
Virtual Mappings: Mapper can choose to map a region to no instance if the task does not use the region itself.
Index Launches:
Control replication:
Fixed array indexing issue.
Tried implementing an index launched version for asymmetric tasks, currently it doesn't appear to be running on scafellpike at all (not building) so I need to work out what I messed up.
Fixed the issue. Asymmetric tasks aren't generally supported in most neighbour search systems right now, only the HP one.
Ok, so it looks like c18749d will need to be reverted (it broke everything). Array indexing is still an issue to be solved, though its possible it could be solved by reversing the indexes when defining arrays (testing this now). Its fixed, and will be pushed soon.
Ok, I pushed some new things.
Array indexing issue is fixed.
First mappers have been pushed. The tradequeue mapper is just the default mapper from Legion and I don't plan to change much about it at the current time.
The high_perform mapper aims to do the following:
At this point, map_task
supposedly tries to reuse reduction clauses for the same processor. However this (for sure) is probably broken. I need to dive into default_create_custom_instance
and friends to check they don't do something to override what map_task
wants to do.
I also need to ensure that the GC doesn't just delete the reduction instances as soon as they're done with.
The new map_task
is much better on memory usage, still some performance issues remain.
The upcoming commit adds functionality to resolve #86
Added some additional documentation.
The remaining 3 points are ok to leave until I have more time to work on this sector. This is ready to merge if/when other projects start that might require the feature set, but for now will be left open to continue performance enhancement.
Manually squashed commits to attempt to better explain the processes done during this pull request.
This branch is now good to go, but is waiting on any of the following criteria before merging:
In terms of performance, I tried (without tracing) different cell sizes on the large dl_meso volume for 2 timesteps: ppc >= 800, 8x8x8 cells, 189.9s ppc >= 1600, 8x8x4 cells, 334.3s ppc >= 400, 17x8x8 cells, 178.6s ppc >= 200, 17x17x8 cells, 750.4s
My assumptions at this point (without checking profiling) is that the limits are:
I'm setting a profile running with 400 ppc to hopefully try to find what is limiting the performance. Notably DL_MESO takes <0.6s per step for this testcase - I'm planning to set a run of DL_MESO on scafell pike to find the main hotspots also.
Ok, so at a basic high level, the dl_meso executable breaks down into: 46% forces_mdvv 31% diff (Check what this does) 7% initialize
I realised that I messed up and didn't compile with -g, so rerunning to get more detail.
Ok, so having another look there are some clear performance issues outside of the kernel performance themselves in the DSL.
compute_new_dests
and other tradequeue operations have massively low CPU utilisation. These are not currently index launched, so we can see if thats possible.per_part_tasks
are in general significantly lighter than their pairwise counterparts. This also results in low CPU utilisation, despite index launches. It might be ideal to find a way to group these tasks into cell-groups. This probably requires some more colour based partitions, and changes to the mapper to be able to use though.One thing that could potentially also be bad for these tasks for full simulations is the mapper just treats both in some "default" way, which could mean no reuse. Another thing to look into.
As for the main workload tasks (pairwise operations), the CPU utilisation is high for these. What I believe to be the issue is computing an n^2 operation for 400 ppc (plus tradequeue stuff) is way too many. DL_MESO uses 68x68x68 cells for this operation, which results in 1.6 particles per cell. I've set a run going with DL_MESO for 17x8x8 to see what happens, and you see massively worse performance, with runtime growing dramtically, with ~60s for 10 timesteps (instead of the 2-3s seen with its desired cell counts).
On the other hand, SWIFT splits cells down to under 400 ppc, and gets great performance, but has significantly better algorithms than are implemented here, which allows task performance to be improved significantly. Further SWIFT does not have to launch tasks in the same way, i.e. it doesn't have to deal with data location, creation of task objects or computing dependencies dynamically. These costs are paid periodically at rebuilds, instead of during the main task runtime. This means tasks can be lighter weight than the tasks in the DSL (though perhaps with tracing this is not true).
Few things to think about just from that on how I might want to improve performance. Being able to do task grouping, particle sorting & other solutions could significantly improve performance potentially.
Need to update the safe_to_combine code to take into account reads/writes/reductions to config spaces.
I think any write or reduction operator to a config should disable combining.
Previous comment only holds if the following kernel makes use of that data.
So the previous comments have been implemented now, and the latest commits are aiming at potential performance improvements by merging some launches to be on larger spatial regions instead of individual cells. This is mostly aimed at lighter weight tasks (such as compute_new_dests
and PER_PART tasks) which are very light individually.
Ok - the new merged launches of those specific types looks like it is a significant improvement for those sections (overall minimally noticable). I think the main thing is still to reduce either the cell sizes or computation required per cell, but I want to merge this before looking at that in its own branch. I'm going to modularise the read and write modules and merge this branch
TODO list:
strcmp
tostrncmp
to match exactly the DL_MESO fortran string comparisonsregentlib.assert(false, ...)
to throw an error when encountering a not yet supported keyword in the input files.Would be nice to also get documentation before merging but may be tight on timescale I want to do this...