Open liangjg opened 5 years ago
Thanks for the analysis! Definitely looks like we need to dig into this some more now to figure out why the performance looks worse. Are you planning on doing any profiling or should I have someone here at ANL look into it? Just want to make sure we're not duplicating efforts.
@paulromano I'll try to do some profiling later this week.
I just did some quick profiling and it looks like xt::any in mesh.cpp eats up a lot of runtime. I'm looking now at addressing that
Also just to chime in with some things that only recently became clear to me
1) Threads need to be pinned appropriately - KMP_AFFINITY for ICPP - e.g. export KMP_AFFINITY="explicit,proclist=[0,1,4,5],verbose" 2) In parallel use report_bindings to make sure your MPI tasks are going where they should be - --report-bindings
1) is the most important I think, otherwise you take a big hit with the threads moving from cpu to cpu.
I hope this is either useful, im still going down the rabbit hole on some of my own issues.
Yes, process/thread binding can have an impact on performance. @liangjg Can you comment on whether you were using process/thread binding for these runs?
@paulromano @makeclean Good point, I didn't use binding options for these runs. I'm trying to print the bindings and update the results later if necessary. But the serial runs show the tally issue already and the no-tally cases show MPI and OpenMP has similar performance which may suggest their bindings should be the same.
Thanks @smharper , making a so quick fix. Just an update about proc/threads binding: for the 2x16 hyper-threading cpus, the core - proc id is defined as:
socket 0
core 0 - {0, 16}
core 1 - {1, 17}
...
core 7 - {7, 23}
socket 1
core 8 - {8, 24}
...
core 15 - {15, 31}
Uniform topology
is used, i.e., each of the 8 threads is bound to total proc set:
KMP_AFFINITY: pid 11018 tid 11018 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
I tried three other modes:
thread 0-7 to proc {0}, {16}, {1}, {17}, {2}, {18}, {3}, {19}
thread 0-7 to proc {0}, {8}, {16}, {24}, {1}, {9}, {17}, {25}
thread 0-7 to proc {0}, {2}, {4}, {6}, {8}, {10}, {12}, {14}
Here are the results (develop branch, 8 threads, mesh 100x100 tally): gcc:
So it is preferable to pin the threads closely for both compiler, it can improve performance by nearly 20%, but it still cannot change the poor parallel efficiency of mutli-threading runs compared to serial runs (less than 50% for gcc and less than 10% for intel).
So the with-meshtally tracking rates have been improved significantly, which is only 10% slower now than the fortran/master version (for gcc compiler).
MeshFilter::get_all_bins()
and FilterBinIter::operator++
. These two functions cost about double time as previous ones in fortran, check the slides for detail.
Following the performance test in #1171 and #1114, I performed a more thorough tests with following simulation configurations:
Here are the results: