scaling of the algorithm (results for the paper)

davydden commented 7 years ago

verify performance and scalability by checking scaling of different steps of the algorithm on various problems sizes.

Do weak and strong (fixed problem size) scaling over three orders of magnitude of model size, preferably with ~10^7 particles and processes from 12 to ~10^4. Measure wall time for all operations. Plot Wallclock time vs cores.

Weak scaling -- keep number of DoFs/particles per core fixed. Plot wallclock time vs cores. Could be tricky to setup for QC (see below).

The test problem is a square domain of crystal generated using the following algorithms: refine the original single cell once. Take the cell in the lower left corner:

Refine the cell once
Take its lower left child and go to (1)

repeat until refined up to fully resolved atomistic level, assuming that the domain size is proportional to the unit cell width.

Parts of the algorithms:

reading in atoms and assigning to cells
evaluation of energy (and forces)
updating neighbor lists

call each N=10 times and measure the average time.

bodduv commented 5 years ago

Some points that came to my mind:

we need to first focus on single node performance before heading to multi-node performance analysis,
energy and gradient calculations are fundamental computation units of qc simulations, so I suppose when benchmarking we could benchmark energy and gradient computations varying different parameters,
current data structure std::multimap<cell, Molecule> is not a good choice, two options arise:
- some sort of memory pool along the lines of particle-in-cell methods of Gassmoller et al,
- using boost::container::flat_map<cell, std::vector<Molecule>>[1] or such (some of the requirements put up by Chanlder Carruth in CppCon 2014 were addressed and shows significant performance improvement; there are other contenders like google::dense_hash_map and ska::flat_hash_map but I suppose Boost library's flat_map would be a better first choice given that it's maintained)

Footnote [1]: In a discuss here, Bangerth suggests that a map of cells to std::vector of particles is a bad idea because we need to resize the vector, but for our application we do not resize it often and acutally it is built once and we traverse the map or find elements in it. Given what we know about std::map's or std::multimap's poor performance, I can implement boost::container::flat_map<cell, std::vector<Molecule>>[2] version of QC fairly quickly

Footnote [2]: It would be boost::container::flat_map<cell, std::unique_ptr<std::vector<Molecule>>>, it is suggested both the key_type and value_type be small (in Chandler's talk) like a std::unique_ptr.

davydden commented 5 years ago

yet another alternative is to have a CSR-like data structure, i.e.

std::vector<unsigned int> start_of_molecules_on_cell (n_cells+1);
std::vector<Molecule> all_molecules;

then we are guaranteed to have a contiguous data layout and iterating over neighbors is not too bad. You can always ask for cell's index and thus know where its molecules start.

But I agree, once you have some relatively large representative example for the paper, we can start profiling and fiddling with data structures.

ps. I spent the whole day today to rework data structures elsewhere from std::map to something similar, hopefully will see a better performance as a result...

davydden commented 5 years ago

two more links I have on related issue:

davydden commented 5 years ago

@vishalkenchan I came across this talk https://moodle.rrze.uni-erlangen.de/pluginfile.php/13826/mod_resource/content/1/07_SIMD_JT.pdf there is a section at a the end about LAMMPS, maybe some parts will be useful for us as well. Have a look.

bodduv commented 5 years ago

I'll have a look next week. I'm in Hamburg for a short vacation.

ltm-erlangen / deal.ii-qc

scaling of the algorithm (results for the paper) #30