20%-30% Speedup Possible by Fixing Suboptimal Array Layout

thliebig / openEMS

openEMS is a free and open-source electromagnetic field solver using the EC-FDTD method.

http://openEMS.de

GNU General Public License v3.0

413 stars 146 forks source link

20%-30% Speedup Possible by Fixing Suboptimal Array Layout #100

Open biergaizi opened 1 year ago

biergaizi commented 1 year ago

I just investigated the performance of the current FDTD engine implementation, and I've identified two bottlenecks that are causing unnecessary slowdowns in simulations - at the same time, they're easy to fix without making any change to the algorithm.

The N_3DArray_v4sf 4D array is allocated as multiple non-contiguous arrays linked by pointers, instead of a single array in linear memory. This is to allow pointers to the multi-dimension array to be accessed via the regular C syntax array[n][x][y][z]. But it comes with a high cost, each time the electric or magnetic field at a single point is accessed, the access must go through three layers of pointer dereferences before the actual data can be obtained. This problem is known as pointer chasing.
X/Y/Z-axis of the E&M fields are stored as the first dimension of the 4D array. However, in a simulation, they're almost always accessed in rapid succession, as in array[0][a][b][c], array[1][a][b][c], array[2][a][b][c]. ~~Each time a component of the field is accessed, it causes a large jump of memory location, as large as the entire size of the 3D space, producing many unnecessary cache misses.~~

A simple fix is available for both problems by changing the memory layout of the arrays. First, the array is flattened to a contiguous region in linear memory, all indirections are removed. Next, move the X/Y/Z components to the last dimension of the array, instead of the first dimension. So each access is consecutive as in n, n + 1, n + 2, allowing the CPU to do a better job at fetching them from memory.

This underlying memory change can also be made transparent to all the current code without rewriting them by C++ operator overloading.

Unfortunately, overloading multi-dimension array indexing, as in array[x][y][z] is not possible in C++. To solve this problem, most software libraries overloaded the operator () instead. So it means every single array access in the current code must be changed from array[n][x][y][z] to array(n, x, y, z) to allow both fixes. It's surely a tedious job, but it does not require modifying any equations, just a lot of keystrokes. And I think this change would be a worthwhile effort.

In my experiment, after these changes are made, I found there's a up-to-30% speedup in simulation speed. Please let me know if you have any comment about this proposed change, and if the project is willing to accept such a patch. Thanks.

biergaizi commented 1 year ago

Update: I withdraw the second claim that:

X/Y/Z-axis of the E&M fields are stored as the first dimension of the 4D array. However, in a simulation, they're almost always accessed in rapid succession, as in array[0][a][b][c], array[1][a][b][c], array[2][a][b][c]. Each time a component of the field is accessed, it causes a large jump of memory location, as large as the entire size of the 3D space, producing many unnecessary cache misses.

The argument for using an "array of structure" is that it allows better vectorization, while the argument for a "structure of array" is that it reduces the number of non-contiguous memory access.

However, after repeated testing, my conclusion is that in the grand scheme of FDTD time-stepping code, the number of cache misses is already big enough that the difference between both option is negligible.

The reason behind the 20%-30% observed speedup is entirely attributed to the fact that the linear array avoids the unnecessary 3-layer pointer-chasing.

thliebig commented 1 year ago

Where do you have your numbers (20-30%) from? Did you do the proposed changes and bench marked it or just theoretical estimations?

First of I know for a fact that the cache misses are a (or the) major issue that slow everything down. That a better memory addressing can improve it by this much, I'm much less sure about.

Memory has to be accessed going +1/-1 in all four dimensions for every single calculation. Thus it will never be a simple n+1 or n-1. and huge jumps in memory are unavoidable (IMHO) unless a crazy complicated memory layout would be invented.

I will not claim that a huge amount of though went into the current memory layout. But some though went into it none the less. Especially for the most difficult last dimension and it's (SSE) vector layout. That said, it would be easy to manually calculate the memory address and directly jump/pick into the array? After all the data is stored linearly in memory in any case?

Therefore I'm looking forward to a pull request demonstrating that a (directly) linear memory layout indeed can achieve a nice increase in speed.

biergaizi commented 1 year ago

Where do you have your numbers (20-30%) from? Did you do the proposed changes and bench marked it

Yes. This number is based on my testing of the draft of my proposed patch, which was quickly hacked together in a few hours. More work is needed to verify its correctness and to convert it as a presentable patch, but early testing seems to suggest it's a viable solution.

Memory has to be accessed going +1/-1 in all four dimensions for every single calculation. Thus it will never be a simple n+1 or n-1.

Right, this is also why I have already withdrawn the second claim. My updated post said the observed 20% speedup is entirely due to the conversion of the 4D array into a 1D one, so exactly one malloc of a single block of memory is involved, removing the intermediate pointer dereferences during array access.

and huge jumps in memory are unavoidable (IMHO), unless a crazy complicated memory layout would be invented.

Indeed. I think, on pure theoretically ground, at least a 200% or more speedup is possible, but it would involve an extremely complicated form of loop tiling/blocking algorithm, which is not something I'm capable of doing.

But if it can be later proven that simply flattening the array alone brings a 20% improvement, this would be essentially "free" improvement.

That said, it would be easy to manually calculate the memory address and directly jump/pick into the array? After all the data is stored linearly in memory in any case?

Not in any case, certainly not in the existing implementation.

In the C programming language, an array allocated by the means of

float ***array;
array = malloc(sizeof(float) * X_SIZE);
for (int y = 0; y < Y_SIZE; y++) {
    array[y] = malloc(sizeof(float) * Y_SIZE);
    for (int z = 0; z < S_ZSIZE; z++) {
        array[y][z] = malloc(sizeof(float) * Z_SIZE);
    }
}

is not linear except for its last dimension. The multiple malloc() causes allocation of arbitrary addresses possibly located in non-contiguous regions in memory. This is nothing but a hack to workaround C's limitation that the multi-dimensional array with a dynamic size cannot be accessed using the array[i][j][k] syntax, but an array of pointers to more arrays can. Thus, accessing each element behaves like a linked list traversal, this is highly inefficient for large arrays.

On the other hand,

float *array;
array = malloc(sizeof(float) * X_SIZE * Y_SIZE * Z_SIZE);

is a true linear array.

In C++, manual address calculation can be abstracted away from the programmer by using C++ operator overloading, so the familiar coordinates n, x, y, z can still be preserved.

The only small technical difficulty is that array[i][j][k] cannot be overloaded, and all code that uses the array[i][j][k] format must be rewritten as array(i, j, k) to allow operator overloading to take place. To make this possible I had to edit almost every single line in the FDTD engine. But at least it needs to be done only once...

thliebig commented 1 year ago

Well it will be interesting to see. But I hope/think that in the inner most loop you do not use the overloaded operator as that would add a lot of unnecessary jumping? I guess some indexing offsets can be calculated in advance and used more efficiently?

biergaizi commented 1 year ago

In my current draft, I used this macro instead.

#define f4_volt(n, x, y, z)                             \
        (_f4_volt                                       \
                [                                       \
                 (x) * (y_max * z_max * n_max) +        \
                 (y) * (z_max * n_max) +                \
                 (z) * (n_max) +                        \
                 (n)                                    \
                ]                                       \
        )

But I think you would probably agree that it's a maintenance nightmare, it's why I proposed operator overloading as a more robust solution. I believe in modern C++, this kind of overloading can be done entirely at compile time with zero runtime overload, just like a macro, I just need to figuring out how...

thliebig commented 1 year ago

Well I just think that inside the engine loop you might better not want to always do all this multiplications but have some calculated offsets (e.g. go index_offsett_x for +/-1 in x-direction) and get away with just adding numbers to a running index value. I hope you understand what I mean... But I hope I will find some time to have a look at your patch soon

thliebig commented 1 year ago

BTW. looking at your reformatting patch. You should obviously only speed optimize the most advanced engine. The base engine (in engine.cpp) should stay the same with the four-dimension arrays to be easily understandable. It is really meant to be slow but easy to read. The formatting patch does indeed help in this regard as well...

biergaizi commented 1 year ago

I believe on a modern CPU, the overhead of hitting a cache miss and waiting for data to load from main memory to the register is many orders of magnitude higher than doing some multiplications to find out the indexes, and also modern compilers should be good at looking for and removing repeated computation using the same variables, so I think this overhead is largely negligible.

Though once the patch is ready, it would be easy to try a few variations to see the actual effects of precomputed indexes in inner loop vs. overloading.

biergaizi commented 1 year ago

The base engine (in engine.cpp) should stay the same with the four-dimension arrays to be easily understandable. It is really meant to be slow but easy to read. The formatting patch does indeed help in this regard as well...

I know. Reformatting all the engines is meant to allow side-by-side comparison.

biergaizi commented 1 year ago

Status Update: I just did some additional testing today, and found the performance impact of optimizations are strongly affected by the CPU microarchitecture of a machine.

The observed 20% speedup of the patch is reconfirmed on my AMD Ryzen 3 machine, and also on an AWS's virtual machine with a Graviton2 ARM64 CPU. However, when the same test was repeated on my own 5th-gen Broadwell Intel Core i7, the speedup ranges from no change to a performance regression, up to 20%. But when -O3 -march=native is used, it produces a 20% speedup on unmodified upstream code, and although the patch doesn't make it faster, the performance reduction of the patch has also mostly disappeared, so both versions roughly have the same speed. I also repeated the test on an Intel Haswell-EP E5-2666 v3 server (also a virtual machine on AWS), curiously, I observed performance reduction after applying the patch. But more curious is the fact that even applying the -O3 -march=native flag to unmodified upstream code can cause a slowdown.

So 4 machines, two have unambiguous speedups, one has no change (if -march=native -O3 is used) or slowdown (if default flags are used) , and another one always has a slowdown.

This makes sense - the FDTD kernel is heavily affected by memory latency and bandwidth, so even a slight change of cache or memory behavior can create a visible difference. BTW, the Amazon's ARM64 CPU (Graviton2) has the highest performance I've ever seen, nearly 200% as fast as all the other machine tested, likely because of a faster memory subsystem. It really shows a great deal of progress has been made by ARM64 CPU designers.

I'll continue to test the patch on more hardware types, and see if I can make further adjustments, perhaps writing a benchmark script that systematically test all the available server types on AWS. If unconditional speedup is not feasible, it perhaps can still be an alternative engine type.

thliebig commented 1 year ago

Thanks for the update and the thorough testing. It also restores my believe that 4D arrays and the upstream style of indexing should at the end result in something similar as to what you have created manually. But still testing it out and maybe finding a way to make this more streamlined could result in a net speedup maybe.

But ultimately only a larger redesign on how the FDTD engine does fetch and work with the data would really result in a significant speedup. The price for that would be an engine that is by far not as flexible and maintainable as it currently is. I have some ideas for at least some less drastic changes that might still result in some speedup. We will see if I can find the time to try them out.

biergaizi commented 1 year ago

I did more tweaking and now I'm able to eliminate the slowdowns, at least with default compiler flags. I'm now getting around 5%-10% speedup across multiple generations of Intel CPUs, and a 20% speedup on AMD Zen 3 and ARM64.

I'll do some more tweaking and send a patch soon.

Some interesting observations during the benchmarking:

Intel and AMD have similar performance after applying my patch, but using unmodified upstream code, the 20% performance penalty seems to suggest that Intel is simply better at pointer chasing than AMD. It's often claimed that AMD Zen has higher memory latency, so perhaps what I'm seeing here is a real-world example of that. Or maybe Intel simply had a better prefetcher can guess memory address better. Anyway, removing the pointers allowed AMD to unleash its performance.
The winner of my AWS benchwarking is the ARM64 Graviton3 CPU. It's 150% as fast as Graviton2, and Graviton2 is almost as fast as Intel Ice Lake and AMD Zen 3 - and these are 200% as fast as Intel Skylake and Haswell. This is an extremely clear pattern - Ice Lake is faster than Haswell because Intel upgraded its memory controller, in fact AWS advertises the increased memory bandwidth as a selling point. And Graviton3 is faster because it's currently the only DDR5 server available, so it naturally outperforms almost everything else. Intel has just announced the latest server CPU in 2023, Sapphire Rapids, which also has DDR5. Now it's already available as closed-beta at AWS. I'll benchmark it as soon as it's available to the general public.

This is strong empirical evidence that confirms our assumption: simulation speed is not determined by the CPU core performance, but its memory latency and bandwidth. Although top performance is not the design priority of openEMS, but to everyone who wants a computer to run openEMS simulations as fast as possible - make sure to select a CPU that supports DDR5, and make sure all memory channels on the motherboard are populated. Even then, limited memory bandwidth means multi-core scaling doesn't work well. Beyond the point of 4 cores or so, a multi-socket server or an MPI cluster is required.

Eventually I'm going to expand the summary above to a full article. Perhaps it can be added to the openEMS documentation in case anyone asks "how to make it faster" in the future.

thliebig commented 1 year ago

Thanks for the update. I have few question:

Did you align the dimensions (making xmax, ymax etc multiple of e.g. 8?) to make sure cache lines do not stretch across dimensions?
Did you benchmark with and without the PML? I would guess that the PML could either suffer or maybe not.
Did you try very small, small and large simulation domains?
Did you try very inhomogeneous vs homogeneous meshes? The operator compression does not work well for inhomogeneous meshes (or inhomogeneous material distributions).

I guess all the above may have large speed impacts, hopefully all in a nice direction. We really could need a good benchmark suite with all these different setups...

biergaizi commented 1 year ago

Did you align the dimensions (making xmax, ymax etc multiple of e.g. 8?) to make sure cache lines do not stretch across dimensions?

Did you try very small, small and large simulation domains?

No, I didn't try them yet, and I just realized the same problem yesterday and this is also what I'm currently worrying about. I'm afraid that the observed performance "improvement" so far is just me getting lucky with array sizes - since I only used two random demos to measure simulation speed. All grid sizes must be systematically tested to see if it really works.

Other potential problems that worth investigating include power-of-2 strides and false sharing. I read in the literature that if the array size is ill-formed, such as using a power of 2, due to the periodic nature of associative cache, all the strides would be mapped to the same cache line and causes a nearly 100% cache miss. So in some cases, we even need to pad the array to induce a deliberate cache line misalignment. There's a similar performance impact if two CPU cores attempts to read different grids in the same cache line.

0xCoto commented 1 year ago

The winner of my AWS benchwarking is the ARM64 Graviton3 CPU.

@biergaizi Which EC2 instance did you test on (C7g I assume, but what are the specs other than the CPU) and how many MC/s did you see with PML?

I'm very interested in the performance of openEMS specifically on EC2 instances, so it'd be nice to know what the best performance-per-cost option would be, as well as ways of scaling.

For example, would it be better to run multiple separate instances (e.g. 4 vCPU?) in parallel, each running a completely different simulation (i.e. running parametric optimization), or is it somehow possible to scale the MC/s rate of a single simulation in some other fashion, given the memory bandwidth is limited by the CPU and the number of RAM channels?

biergaizi commented 1 year ago

Which EC2 instance did you test on (C7g I assume, but what are the specs other than the CPU) and how many MC/s did you see with PML? I'm very interested in the performance of openEMS specifically on EC2 instances

The results are still highly experimental. I'll publish a full report when I'm confident with my investigation. Perhaps I can even submit my report as part of the official documentation.

For example, would it be better to run multiple separate instances (e.g. 4 vCPU?) in parallel, each running a completely different simulation (i.e. running parametric optimization)

Yes, this is the recommendation based on my current observation. More so because there's problem #103.

or is it somehow possible to scale the MC/s rate of a single simulation in some other fashion, given the memory bandwidth is limited by the CPU and the number of RAM channels?

openEMS supports MPI to run a single simulation on a cluster using simulation domain decomposition to bypass the memory bandwidth limitation, up to the point of hitting MPI communication bottleneck.

0xCoto commented 1 year ago

This has probably already been observed by you guys, but I decided to test the performance of each ABC and I'm noticing quite a bit of variation between the different extensions (i7-10750H CPU @ 2.60GHz - relative differences may differ for Graviton3's ARM architecture):

ABC Extension	Solve Rate
PEC	~200 MC/s
PMC	~200 MC/s
MUR	~155 MC/s
PML (8 cells)	~63 MC/s

I personally almost exclusively tend to use the PML extension (as it tends to be the most useful for the majority of simulation cases), and it is hinting that a ~3x improvement could be achieved if the PML was somehow as fast as PEC/PMC (or even as efficient as MUR).

Obviously getting the PML to be as fast as the other ABCs is easier said than done, considering it inherently involves more computations, but I'm just pointing this out in case a quick look reveals anything interesting that could be improved from a data structures perspective.

(engine_ext_upml.cpp hasn't been updated in 12 years, so who knows.. :slightly_smiling_face:)

biergaizi commented 1 year ago

Yes, PML's low performance is a known problem. MUR ABC only needs to access 1 coefficient to compute the value in the grid, but PML ABC needs 3 of them, so the 3x performance penalty is completely unsurprisingly... The official wiki (unfortunately still offline..., archived version) even has a notice: "Info: This ABC is not optimally implemented regarding the simulation speed. Use the Mur-ABC for faster simulations."

But still, PML ABC is extremely useful and the slow speed limits its usefulness.

The way I see it, is that openEMS's architecture is fundamentally disadvantaged in terms of performance. As an academic field solver, the idea is to keep the core engine as simple as possible (MIT's MEEP also used a similar design), with a simple kernel with standard FDTD equations, everything else like excitation signals, lossy metals, or PML/ABC, is implemented as plugins. To run a simulation, the flowchart is basically: run most points in 3D space through the plugins to do pre-update, run all points in 3D space again through the main FDTD kernel, and finally run most points in 3D space again through the plugin to do post-update. The extra redundant loads and stores create significant memory bandwidth overhead...

For PML, there's a 300% amplification effect (not the same thing as the three coefficients involved) for some memory accesses at the position of PML because the grid value is accessed repeatedly in pre-update, main kernel and post-update.

Now the problem is whether it can be optimized without sacrificing modularity. I'm now seeing two potential solutions.

First is to use (single-thread) domain decomposition. Instead of going through the entire 3D space at a time, perhaps the update can be splitted into multiple blocks. Instead of running pre-update, main-update and post-update across the entire 3D space at once, as in:

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int x = 0; z < z_max; z++) {
            pre_update(x, y, z)
        }
    }
}

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int x = 0; z < z_max; z++) {
            fdtd_kernel(x, y, z)
        }
    }
}

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int x = 0; z < z_max; z++) {
            post_update(x, y, z)
        }
    }
}

It can perhaps be performed in a piecewise manner, allowing the main FDTD kernel and the plugins can reuse the data in CPU cache.

int x_start[], x_end[];
int y_start[], y_end[];
int z_start[], z_end[];
int domain_len;

for (int domain = 0; domain < domain_len; domain++) {
    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                pre_update(x, y, z)
            }
        }
    }

    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                fdtd_kernel(x, y, z)
            }
        }
    }

    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                post_update(x, y, z)
            }
        }
    }
}

Unfortunately, if I read the code correctly, full 3D domain decomposition is not supported by openEMS. It only supports decompositing the domain across the X axis. So if there are many cells at the Y and Z directions, it still won't fit in cache. Implementing this solution needs adding full 3D partitioning to all update functions.

Another possible solution is allowing a plugin to insert itself dynamically into main field update loop of the main kernel to "fuse" the computation together, rather than using its own redundant loop. This way a few memory accesses can be saved.

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int x = 0; z < z_max; z++) {
            if (has_extension(x, y, z)) {
                pre_update(x, y, z);
                fdtd_kernel(x, y, z);
                post_update(x, y, z);
            }
            else {
                fdtd_kernel(x, y, z)
            }
        }
    }
}

Modularity can still be somewhat preserved by using function pointers instead of hardcoding extension into the main engine.

Both solutions also have the problem of losing the ability to randomly access a coherent electromagnetic field at any point in space from a plugin, so it cannot be applied to all extensions.

However, it can perhaps at least be a fast path for plugins that don't need access to the whole field, like PML.

relative differences may differ for Graviton3's ARM architecture

I now have testing data on 8 different computers, including 6 AWS servers. Here I have Intel Sandybridge, Haswell, Skylake, Icelake, AMD Zen 2, Zen 3, AWS ARM64 Graviton2, Graviton3. It's safe to say that the same pattern exists regardless of CPU architecture. The openEMS simulation is almost entirely memory-bound. CPU performance is irrelevant. Well, not entirely, the DDR generation and number of channels supported by the CPU matter, Graviton3's DDR5 has great bandwidth and can reach 1000 MC/s with 4 threads in a pure-PEC simulation...

Another observation is that field dump can be extremely slow for various reasons - it's single threaded, it creates redundant memory accesses, and possibly pollutes the cache. If the simulation domain is very small and fits in last-level cache, enabling a near-field to far-field dump box can create a more than twofold slowdown.

0xCoto commented 1 year ago

Graviton3's DDR5 has great bandwidth and can reach 1000 MC/s with 4 threads in a pure-PEC simulation...

When you say "pure-PEC" are you referring to the ABC, or the material content of the simulation domain? 1000 MC/s with PML would sound quite nice for an inexpensive(?)* instance.

*Would that be with e.g. c7g.xlarge (8.0 GiB, 4 vCPUs) or a different Graviton3 instance?

biergaizi commented 1 year ago

By pure-PEC, I meant that the boundary conditions of the entire simulation box were all set to PEC, no PML, Mur ABC or any other extensions were used, and all field dumps were disabled... Using Helical_Antenna.py as a test, I reached 1000 MC/s (this was the case even when I increased the number of cells by many times to rule out the effect of Last-Level Cache).

This is definitely not a practical setup, but it does show the effect of DRAM bandwidth.

Using realistic boundary conditions, FDTD.SetBoundaryCond( ['MUR', 'MUR', 'MUR', 'MUR', 'MUR', 'PML_8'] ), performance dropped to 600 MC/s. And after enabling field dump, it dropped further to 250 MC/s.

biergaizi commented 1 year ago

If you want to see more practical results, here are some preliminary ones from my tests. I have more results on different machines but I do not want to release my dataset until everything is ready.

Nickname	Type	CPU	Cores	Memory
skylake	c5n.xlarge	Xeon Platinum 8124M @ 3.0 GHz	2C/4T	DDR4
icelake	c6i.xlarge	Xeon Platinum 8375C @ 2.9 GHz	2C/4T	DDR4
zen3	c6a.xlarge	AMD EPYC 7R13 @ 3.0 GHz	2C/4T	DDR4
graviton3	c7g.xlarge	ARM Neoverse-V1 @ 2.6 GHz	4C	DDR5

In total, 10 simulations were selected from the openEMS official demo as benchmarks to compare the performance of unmodifed upstream code on different CPUs, including 7 Python scripts and 3 Octave script, they're: Bent_Patch_Antenna.py, Helical_Antenna.py, RCS_Sphere.py, Simple_Patch_Antenna.py, CRLH_Extraction.py, MSL_NotchFilter.py, Rect_Waveguide.py, CRLH_LeakyWaveAnt.m, CylindricalWave_CC.m, and StripLine2MSL.m.

In order to test the FDTD engine, all post-processing steps, such as plotting or near-field to far-field transformation, have been deleted from the script. Other elements, such as dumpboxes, were kept as-is. The full benchmarking test suite can be found at this repository.

Each benchmark was repeatedly executed from 1 threads to 4 threads, and the entire benchmark process was repeated for 3 times. The fatest speed record for each script is used for comparison. The speed record for each script is used for comparison regardless of the number of threads, since different kind of simulations have different memory access patterns, the optimal number of thread is different.

This is how Graviton3 can completely beat older x86_64 chips in memory-bound workloads, not by faster cores (the code simply cannot utilize the core), just by the sheer bandwidth of new DDR5 memory. The 4th-gen Intel Xeon Scalable (Sapphire Rapids) also has DDR5, but it's still in closed-beta on AWS, will test it too when it's publicly available.

Some tests showed no improvements, as the simulation domain is too small and almost all overhead came from the field dump.

Also, you may notice that AMD is slower than Intel, but it doesn't need to be. Using my experimental optimization patch, Intel Icelake and AMD Zen3 have similar performance.

Skylake vs Graviton3 Icelake vs Graviton3 Zen3 vs Graviton3

biergaizi commented 1 year ago

I just created Pull Request #105, please take a look and continue the discussion here.

KJ7LNW commented 8 months ago

Hi @biergaizi.

So #105 closed in favor of #117, and then #117 closed in favor of a "more powerful tiling engine."

Is the new tiling engine available somewhere for testing?

-Eric

biergaizi commented 8 months ago

@KJ7LNW Yes. A very early prototype is available here: https://github.com/thliebig/openEMS-Project/discussions/92 More improvements are still under development.