thliebig / openEMS

openEMS is a free and open-source electromagnetic field solver using the EC-FDTD method.
http://openEMS.de
GNU General Public License v3.0
413 stars 146 forks source link

Inconsistent Multi-thread Performance Especially on Many-Core Systems #103

Closed biergaizi closed 1 year ago

biergaizi commented 1 year ago

During my performance testing, I've found the multi-threaded performance of openEMS is inconsistent. A performance degradation around 10% to 20% is common on desktops during "unlucky" runs, the probability of triggering it seems to be one per 5-10 test runs.

This inconsistency can be extreme on a many-core system, like a workstation or server with 16 cores or more. In one test run, I saw a performance variation about 1000% in the worst case scenario, jumping from 150 MC/s from a good run to 25 MC/s in a pathologically bad test run of MSL_Filter.py on a 64-core AWS virtual server with 10 threads.

I suspect the dynamic memory allocation of many small chunks, as I reported in #100, can be a possible reason, but I was still occasionally seeing the 10% to 20% performance variation even with my contiguous-memory patched applied, so there must be other factors.

My current hypotheses include:

  1. 2D arrays. I only patched the vectorized N_3DArray memory allocator, but openEMS still uses many dynamically allocated 2D arrays, which may have the same problem.
  2. Some kind of memory / cacheline alignment problem, only triggered when the offset is unlucky.
  3. Bad process scheduling by the operating system. Perhaps some process arrangement can be particularly bad, for example, if a core and a hyperthread are scheduled to processes with unrelated simulation domain, there would be a massive overhead. Similar problem can occur in a NUMA platform.

I'll continue to investigate this performance problem.

biergaizi commented 1 year ago

It's worth repeating the same test using the MPI engine. MPI uses message-passing while multi-threaded engine uses shared memory, so if the problem is memory contention, the MPI engine should work much better than the multi-threaded engine on the same machine.

Does the MPI engine support automatic domain decomposition like the multi-threaded one? I read the old example code and found the simulation domain had to be decomposited explicitly in the code...

thliebig commented 1 year ago

I would assume that you never use all available threads as this would not be the most efficient way. And yes in this case if the threads are on cores that do not work well together this could be bad. But that would depend on the CPU of course.

It was my observation that MPI really only brings any benefit if the simulation domain is huge. Otherwise the "slow" MPI communication becomes a much larger bottleneck than the memory speed very quickly. Thus I never really used or put much effort in MPI. The first and best approach for faster simulation should always be to create a better mesh.

0xCoto commented 1 year ago

It was my observation that MPI really only brings any benefit if the simulation domain is huge.

How huge are we talking? Even with optimized meshes, certain simulations can occupy 1-50M cells depending on the size of the model, so it'd be interesting to have some idea of when about MPI begins to become useful.

biergaizi commented 1 year ago

I've determined that the "1000% outlier" was probably just a mistake due to the debugger overhead, but I'm still convinced that the 10% variation is a real effect. I'm still testing my patch and will report back if I found anything.

biergaizi commented 1 year ago

During the investigation for #105, large amount of data has been collected. I used a SQL query to show speed outliers but there's nothing obvious. So even if the problem really does exist, it seems to be a problem very specific for my system. I'm closing this issue for now, unless I or anyone else found reproducible evidence of its existence.