Closed valassi closed 1 year ago
I made a few quick plots because I want them for the talk and because I reused the infrastructure I already had for benchmarking/HPC from last year. Eventually this stuff could be cleaned up and integraed by @Andykyu12 in his infrastructure.
https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-nosimd.png https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-simd.png
These are VERY PRELIMINARY - I am somewhatconfident that the scaling with multiple copies is correct and not just an artefact of how these short scripts are run, but this should be thoroughly cross checked...
Indeed/instead the OMP multithreading is suboptimal. I had a quick look at numactl but it does not seem to improve things. I would maybe reconsider an alternative to OMP, this is such a basic one loop that we can do it by hand with custom threads. Maybe we also need to split up the memory ourselves in advance across threads.
Anyway, the good news are:
PS the memory plots are just a sanity check here, not the main point of these plots - not clear where each applicaton spends most of its memory, but it may well be in the allocation of the large pages
PPS forgot to mention on the plots, these are all 'check.exe -p 2048 256 40'... around 20-30 seconds for each test
I am closing this old ticket for eemumu OMP mult threading with an old software version.
I now have a newer software version and reenabled OMP MT there. I still gete suboptimal results also for ggttgg, but the results get better as I process more and more events. Anyway, I will follow up in this ticket: #575
Closing.
So far I have been mainly checking OMP performances on my GPU-aware VM with 4 virtual cores. The throughput always seemed to scale more or less as expected, almost a factor 4.
See https://github.com/madgraph5/madgraph4gpu/commit/6f4916ec5552f57d8b58520f924c28e1e495673c, from itscrd70:
and
In other words, going to 4 threads is increasing:
So, actually already above one issue can be seen, namely OMP scaling is suboptimal when SIMD is also enabled. This should be cross checked, but I would not be surprised if it is also some processor slowdown (check the clock...).
HOWEVER. The situation looks quite different on a machine with many more nodes. I did the following tests on pmpe04 (16 physical cores with 2x hyper threading) and @lfield also did some tests on a machine with 64 logical cores, getting similar results.
FIRST issue: this is very unstable: this is with no SIMD, you can see that the result with 32 threads fluctuates wildly between 3.0E6 and 1.3E7 (more than a factor 4 fluctuation!). Something to do with NUMA?...
SECOND issue: the actual scaling is suboptimal. In the example above, the best scaling at 16 threads (the number of physical cores) is 7.9E6 divided by 8.9E5, which is less than a factor 10, while I would expect a solid factor 16 here...
The numbers get better with a larger number of events, even without OMP (why?!...), but they are still suboptimal and fluctuating, on pmpe04
Compare to the same on itscrd70:
This is even more obvious with SIMD, 32 threads gain a maximum factor 6?...
TODO: