openmc-dev / openmc

OpenMC Monte Carlo Code
https://docs.openmc.org
Other
760 stars 488 forks source link

Performance tests #1184

Open liangjg opened 5 years ago

liangjg commented 5 years ago

Following the performance test in #1171 and #1114, I performed a more thorough tests with following simulation configurations:

Here are the results:

screen shot 2019-03-03 at 21 17 37
paulromano commented 5 years ago

Thanks for the analysis! Definitely looks like we need to dig into this some more now to figure out why the performance looks worse. Are you planning on doing any profiling or should I have someone here at ANL look into it? Just want to make sure we're not duplicating efforts.

liangjg commented 5 years ago

@paulromano I'll try to do some profiling later this week.

smharper commented 5 years ago

I just did some quick profiling and it looks like xt::any in mesh.cpp eats up a lot of runtime. I'm looking now at addressing that

makeclean commented 5 years ago

Also just to chime in with some things that only recently became clear to me

1) Threads need to be pinned appropriately - KMP_AFFINITY for ICPP - e.g. export KMP_AFFINITY="explicit,proclist=[0,1,4,5],verbose" 2) In parallel use report_bindings to make sure your MPI tasks are going where they should be - --report-bindings

1) is the most important I think, otherwise you take a big hit with the threads moving from cpu to cpu.

I hope this is either useful, im still going down the rabbit hole on some of my own issues.

paulromano commented 5 years ago

Yes, process/thread binding can have an impact on performance. @liangjg Can you comment on whether you were using process/thread binding for these runs?

liangjg commented 5 years ago

@paulromano @makeclean Good point, I didn't use binding options for these runs. I'm trying to print the bindings and update the results later if necessary. But the serial runs show the tally issue already and the no-tally cases show MPI and OpenMP has similar performance which may suggest their bindings should be the same.

liangjg commented 5 years ago

Thanks @smharper , making a so quick fix. Just an update about proc/threads binding: for the 2x16 hyper-threading cpus, the core - proc id is defined as:

socket 0
  core 0 - {0, 16}
  core 1 - {1, 17}
  ...
  core 7 - {7, 23}
socket 1
  core 8 - {8, 24}
  ...
  core 15 - {15, 31}

Here are the results (develop branch, 8 threads, mesh 100x100 tally): gcc:

So it is preferable to pin the threads closely for both compiler, it can improve performance by nearly 20%, but it still cannot change the poor parallel efficiency of mutli-threading runs compared to serial runs (less than 50% for gcc and less than 10% for intel).

liangjg commented 5 years ago

Updated performance testing with the newer version and profiling

screen shot 2019-03-07 at 21 43 11

So the with-meshtally tracking rates have been improved significantly, which is only 10% slower now than the fortran/master version (for gcc compiler).