Performance tests - Githubissues

liangjg commented 5 years ago

Following the performance test in #1171 and #1114, I performed a more thorough tests with following simulation configurations:

OpenMC versions:
- master (v0.10.0)
- develop (C++, Mar 1, 2019, 4877ffc)
Compilers:
- GCC (6.2.0)
- Intel (18.0.1)
Machine:
- Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2x16 cpus
Parallelization:
- serial
- mpi-8 processes
- openmp-8 threads
Model:
- pwr pin cell ("openmc/examples/xml/pincell"), 5 bacthes x 40k neutrons/batch
Tallies:
- none
- mesh 10x10, 3-group reactions rates
- mesh 100x100, 3-group reaction rates

Here are the results:

if no tally is calculated, develop branch is as good as master, intel is as good as gcc, openmp is as good as mpi.
however, for cases with tally to be calculated,
- develop branch is worse than master: 100x100 mesh tallies make develop version 10+ slower than w/o tally, this ratio is ~4 for master version; this is even worse when using multi-thread and compiled with Intel;
- OpenMP efficiency is worse than MPI: for develop branch, multi-threading is 2+ times slower than multi-processing, this is even worse for Intel compiler.
- Intel is worse than GCC: Intel is 1.3x slower than GCC for serial and MPI running; it has the worst efficiency when multi-threading is adopted (only for with-tally cases).

paulromano commented 5 years ago

Thanks for the analysis! Definitely looks like we need to dig into this some more now to figure out why the performance looks worse. Are you planning on doing any profiling or should I have someone here at ANL look into it? Just want to make sure we're not duplicating efforts.

liangjg commented 5 years ago

@paulromano I'll try to do some profiling later this week.

smharper commented 5 years ago

I just did some quick profiling and it looks like xt::any in mesh.cpp eats up a lot of runtime. I'm looking now at addressing that

makeclean commented 5 years ago

Also just to chime in with some things that only recently became clear to me

1) Threads need to be pinned appropriately - KMP_AFFINITY for ICPP - e.g. export KMP_AFFINITY="explicit,proclist=[0,1,4,5],verbose" 2) In parallel use report_bindings to make sure your MPI tasks are going where they should be - --report-bindings

1) is the most important I think, otherwise you take a big hit with the threads moving from cpu to cpu.

I hope this is either useful, im still going down the rabbit hole on some of my own issues.

paulromano commented 5 years ago

Yes, process/thread binding can have an impact on performance. @liangjg Can you comment on whether you were using process/thread binding for these runs?

liangjg commented 5 years ago

@paulromano @makeclean Good point, I didn't use binding options for these runs. I'm trying to print the bindings and update the results later if necessary. But the serial runs show the tally issue already and the no-tally cases show MPI and OpenMP has similar performance which may suggest their bindings should be the same.

liangjg commented 5 years ago

Thanks @smharper , making a so quick fix. Just an update about proc/threads binding: for the 2x16 hyper-threading cpus, the core - proc id is defined as:

socket 0
  core 0 - {0, 16}
  core 1 - {1, 17}
  ...
  core 7 - {7, 23}
socket 1
  core 8 - {8, 24}
  ...
  core 15 - {15, 31}

For all MPI runs, the procs are bound to cores by default, I didn't see performance changes when trying to change the bindings.
For multi-threads runs, by default a Uniform topology is used, i.e., each of the 8 threads is bound to total proc set: KMP_AFFINITY: pid 11018 tid 11018 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31} I tried three other modes:
- compact/close: thread 0-7 to proc {0}, {16}, {1}, {17}, {2}, {18}, {3}, {19}
- scatter/spread: thread 0-7 to proc {0}, {8}, {16}, {24}, {1}, {9}, {17}, {25}
- bind to core manually: thread 0-7 to proc {0}, {2}, {4}, {6}, {8}, {10}, {12}, {14}

Here are the results (develop branch, 8 threads, mesh 100x100 tally): gcc:

default, 1984 n/s
compact: 2516 n/s
scatter: 2012 n/s
bind to core manually: 1985 n/s intel:
default, 265 n/s
compact: 409 n/s
scatter: 226 n/s
bind to core manually: 265 n/s

So it is preferable to pin the threads closely for both compiler, it can improve performance by nearly 20%, but it still cannot change the poor parallel efficiency of mutli-threading runs compared to serial runs (less than 50% for gcc and less than 10% for intel).

liangjg commented 5 years ago

Updated performance testing with the newer version and profiling

the following is the new integral performance testing results with the same configurations as above excpet
- the newer develop branch (5083ac) including the #1185 fix.
- weak scaling for parallel runs (200 k neutrons per proc/thread)
- added depletion tally case

So the with-meshtally tracking rates have been improved significantly, which is only 10% slower now than the fortran/master version (for gcc compiler).

Then a detailed profiling was performed using vtune (please refer to this slides if you want to check some detailed vtune outputs). It is found :
- the 10% slowess comes from the tally part (about 20% slower), which is mainly caused by two functions MeshFilter::get_all_bins() and FilterBinIter::operator++. These two functions cost about double time as previous ones in fortran, check the slides for detail.
- I didn't see evidence that xtensor arrays causes any obviouse performance problem compared to the previous fortran arrays.
- Intel has severe performance degradation in multi-threading runs because of the overhead of OpenMP atomic operation in tally scoring. This single line costs more than 90% of total cpu time for the 100x100 mesh tally case with 8 threads. I sugguest this is likely an Intel OpenMP implementation problem (in another testing with intel 2019 version, it seems this issue goes away).

openmc-dev / openmc

Performance tests #1184

Updated performance testing with the newer version and profiling