NemoLite2D benchmarking results on HPC-level hardware

LonelyCat124 commented 4 years ago

I think I (and @sergisiso ?) may be reaching the point where we're reaching some level of maturity in the newer versions of the NemoLite2D, and I think it would be best for us to begin to collate results. As previously discussed I think an idea was a paper discussing comparative benchmarks of various parallel systems applied to the NemoLite2D benchmark.

As far as I understand it, we have the following versions: Manual versions:

Fortran OpenMP
Fortran serial
Fortran/C OpenCL (?)
Regent
C++ OpenMP
C++ Kokkos

PSYclone generated:

OpenMP
OpenACC

My proposal would be then the following benchmarking results:

[ ] Fortran OpenMP version with gcc9 on Skylake
[ ] Fortran OpenMP version with intel/? on Skylake
[ ] Fortran Serial version with gcc9 on Skylake
[ ] Fortran Serial version with intel/? on Skylake
[ ] Regent version on Skylake
[ ] C++ OpenMP version with gcc9 on Skylake
[ ] C++ OpenMP version with Intel on Skylake
[ ] C++ Kokkos version on Skylake
[ ] PSYclone OpenMP generated version on Skylake
[ ] OpenCL version on appropriate hardware (@arporter) - I assume we want both the GPU in Glados and on ScafellPike? FPGA also an option.
[ ] OpenACC version on GPU
[ ] PSYclone generated OpenACC version on GPU

I think what we should record for each set of results is:

git hash for the commit
Compiler version & Compile flags used (where appropriate, e.g. Regent version will only have the flags passed to the install.py)
Hardware (should be simple)
Runtimes at various threadcounts, I'd think at least 1/2/4/8/16/32 for CPUs, and strong scalability plots/parallel efficiency plots corresponding to those results. I would expect the scalability results should be with respect to the faster of the serial version or OpenMP version for the given compiler where appropriate (or plots corresponding to both). For GPUs I assume we should run various tests and just show the best achievable runtimes and the appropriate data for reproducability.
Runtime parameters should be tested where appropriate too - e.g. OpenMP Schedule options.

Does this all seem reasonable?

arporter commented 4 years ago

Currently PSyclone can generate an OpenACC version. In both OpenACC and OpenMP we have yet to really 'go to town' to see how well we can do - we've only done fairly vanilla implementations. If we're going to write a paper then, in keeping with our self-proclaimed "era of performance" we will want to do better! (i.e. we don't want "it works but it's slow".)

We have a manual MPI version working. No PSyclone support for that yet (it's on @rupertford's list :-) ).

GPU-wise, I think SKFP and glados have the same V100s and therefore we can just use one or other of them. Currently the OpenCL we generate is aimed at FPGA. There may be some infrastructure work to do in order to get it working on the GPU (although, now I come to think about it, I think @sergisiso has run on a GPU recently so we may be OK).

Finally, and slightly bigger-picture, we need to think how this relates to 'the PSyclone paper' that we've been threatening to write for about 2 years... It feels like there's a lot to discuss...

sergisiso commented 4 years ago

Leaving the paper considerations aside, I think being able to generate this performance snapshots programatically were it records the commit/compiler-version/architecture would be very useful and I have been trying to start this with the common makefile infrastructure. In #37 the compiler column of make summary also includes the version.

I have been experimenting with providing architecture details in the table as well. But the table has some limitations in the number of fields it remains readable. So we may need something else that store big tables in a file (with flags, parameters, ...)

Regarding point 4, it will be good to add is a Makefile rule or a common script to provide scalability tables (which also include the mentioned parameters) and maybe adding PU, something like:

Implementation         PU   Compiler        Arch    ua checksum     uv checksum     time/step
psykal_omp:             1      gcc-7.4        cpu     0.41022150E-01  0.50252378E+01  0.35053E-01
psykal_omp:             2      
psykal_omp:             4      
psykal_omp:             8

And then a gnuplot can easily draw plots form some of this tables. @LonelyCat124 If that will be useful for you we can coordinate this work.

Regarding OpenCL, it can run on CPU, GPU and FPGA, and return correct results, but I am not claiming yet that it does a sensible thing in each platform :)

LonelyCat124 commented 4 years ago

That would probably be useful for most of these - I could probably do something like that even for Regent, though I don't have the checksums implemented yet (I probably should do this soon). Maybe a script would be easier? If we always save the executable to nemolite2d.exe (or similar, which regent can also do now I worked out how) we could point it at a directory and it could run the appropriate executable, and move/create namelists accordingly for different sizes too.

I'm personally not a fan of gnuplot plots (vs matplotlib) 😂 but happy to go with them if everyone else prefers them.

arporter commented 4 years ago

I like gnuplot for its ubiquity and speed of use. I've never managed to get paper-quality images out of it though so am happy to use something else. (I like xmgrace but that's showing my age.) Although I'm all in favour of automation where possible, I don't think we should get too hung up on it if it proves complicated (especially once batch systems become involved). The key thing is to capture all the necessary data in one place and in a format we can plot.

LonelyCat124 commented 4 years ago

I think with python it could be pretty straightforward to have something that you can send into bsub/qsub/whatever and go from there (as opposed to running on the top level), or even just a bash script for most of it except maybe plotting depending on whats used. Once I've finished my non-ECP project properly I could have a go if noone else wants to bite the bullet

stfc / PSycloneBench

NemoLite2D benchmarking results on HPC-level hardware #41