Closed biergaizi closed 1 year ago
It's worth repeating the same test using the MPI engine. MPI uses message-passing while multi-threaded engine uses shared memory, so if the problem is memory contention, the MPI engine should work much better than the multi-threaded engine on the same machine.
Does the MPI engine support automatic domain decomposition like the multi-threaded one? I read the old example code and found the simulation domain had to be decomposited explicitly in the code...
I would assume that you never use all available threads as this would not be the most efficient way. And yes in this case if the threads are on cores that do not work well together this could be bad. But that would depend on the CPU of course.
It was my observation that MPI really only brings any benefit if the simulation domain is huge. Otherwise the "slow" MPI communication becomes a much larger bottleneck than the memory speed very quickly. Thus I never really used or put much effort in MPI. The first and best approach for faster simulation should always be to create a better mesh.
It was my observation that MPI really only brings any benefit if the simulation domain is huge.
How huge are we talking? Even with optimized meshes, certain simulations can occupy 1-50M cells depending on the size of the model, so it'd be interesting to have some idea of when about MPI begins to become useful.
I've determined that the "1000% outlier" was probably just a mistake due to the debugger overhead, but I'm still convinced that the 10% variation is a real effect. I'm still testing my patch and will report back if I found anything.
During the investigation for #105, large amount of data has been collected. I used a SQL query to show speed outliers but there's nothing obvious. So even if the problem really does exist, it seems to be a problem very specific for my system. I'm closing this issue for now, unless I or anyone else found reproducible evidence of its existence.
During my performance testing, I've found the multi-threaded performance of openEMS is inconsistent. A performance degradation around 10% to 20% is common on desktops during "unlucky" runs, the probability of triggering it seems to be one per 5-10 test runs.
This inconsistency can be extreme on a many-core system, like a workstation or server with 16 cores or more. In one test run, I saw a performance variation about 1000% in the worst case scenario, jumping from 150 MC/s from a good run to 25 MC/s in a pathologically bad test run of
MSL_Filter.py
on a 64-core AWS virtual server with 10 threads.I suspect the dynamic memory allocation of many small chunks, as I reported in #100, can be a possible reason, but I was still occasionally seeing the 10% to 20% performance variation even with my contiguous-memory patched applied, so there must be other factors.
My current hypotheses include:
I'll continue to investigate this performance problem.