Closed biergaizi closed 1 year ago
Okay, this such an amount of work and information. This will take some time to process. Furthermore I really should release a v0.0.36 prio to applying this I guess and have all this changes for the next version then...
Let me make some comments to you idea for future work. I have some ideas on that topic too, which would attack the problem from yet another angle, e.g. by reducing the amount of coefficients by spending some more calculations. This ideally with a better/new run-length encoding of the coefficients and then having possibly multiple different engines for different requirements (e.g. pml vs lossless vs lossy vs ... spaces). But all of this would ideally done in a "new" engine, maybe using AVX as well...
By the way, what you propose in the "Engine Architecture Improvement" of calling (non inline) functions and if clauses in the inner most loop should be a performance nogo? Those permanent context switches should be very bad I think...
This ideally with a better/new run-length encoding of the coefficients and then having possibly multiple different engines for different requirements (e.g. pml vs lossless vs lossy vs ... spaces).
I was also thinking about the idea of floating-point compression to reduce memory traffic. The so-called "compressed operators" for now is just "deduplicated operators" with no real compression. I've read some papers on this topic. A lot of work has been done in the AI/ML field with the emphasis of half-precision floating point, which is 100% unacceptable for scientific computing. But the result of some papers can be applied to single and double-precision numbers as well, however, another roadblock is that most compression schemes are lossy compression, so the result would still be worse than single-precision floating point. So I don't think it would be an acceptable solution. Furthermore, many are difficult to apply without custom hardware.
We must find a way to compress the coefficients losslessly, and the overhead of decoding/decompression must be low enough to get a net speedup.
Another quick and simple optimization I've seen is the vacuum optimization. In many simulations, empty space occupies a large portion in the simulation domain. These locations have a fixed permeability and permittivity, a special case can be added in the FDTD kernel to allow direct calculation without accessing coefficients. Though I'm still not sure whether the idea is applicable in openEMS.
calling (non inline) functions and if clauses in the inner most loop should be a performance nogo?
I'm aware of that, function calls and conditional branching can be expensive. My hope for is that, despite these two sources of overheads, the final result would still be a net speedup due to less data movements and reduced back-end pressure on memory.
Now another practical problem I'm seeing is that, the proposed change doesn't interact well with the priority system. In the existing engine, PML has the highest priority, and the cylindrical extension has the second highest priority. But if PML calculations are fused into the main kernel, necessarily, its priority would be lower than the cylindrical extension, making them incompatible.
Okay, this such an amount of work and information. This will take some time to process. Furthermore I really should release a v0.0.36 prio to applying this I guess and have all this changes for the next version then...
Also, because this PR is built upon PR #101, which reformatted the coding styles of all three FDTD engines, I strongly suggesting reviewing and merging PR #101 ("FDTD: reformat code of update equations.") first, before reviewing and merge this main work. This way, the diff
view will only show the real differences, instead of marking everything as fresh changes because of code reformatting.
The actual changes involved in this Pull Request is in fact minimal. The only real change in introducing flat_array_ops.h
, all the other changes are simply replacing array[n][x][y][z]
with array = *array_ptr; array(n, x, y, z)
.
By the way which OS Image did you try on c7g.xlarge
? Do you think Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume
would outperform Ubuntu considering Amazon supposedly optimizes its OS for their hardware?
Any suggestions on storage? | Configure Storage |
---|---|
General purpose SSD (gp3) | |
General purpose SSD (gp2) [default] | |
Provisioned IOPS SSD (io1) | |
Provisioned IOPS SSD (io2) | |
Cold HDD (sc1) | |
Throughput Optimized HDD (st1) | |
Magnetic (standard) |
(I assume no other "Advanced details" have to be adjusted from the EC2 launch instance menu for optimal performance)
By the way which OS Image did you try on c7g.xlarge? Do you think Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume would outperform Ubuntu considering Amazon supposedly optimizes its OS for their hardware?
For consistency, all of these tests with the exception of Zen 2 (my home server), used Ubuntu. Also, no CPU-specific optimization flags were used. If Amazon Linux uses a fork of GCC with better code generation and vectorization for Graviton, it can possibly outperform Ubuntu. But I haven't checked if it's the case.
Any suggestions on storage?
Storage in general is not the bottleneck. The default is okay.
But I haven't checked if it's the case.
I'm happy to do a quick comparison test once your PRs are merged (assuming I find a way to install all the dependencies with yum
...). For now, I just tested on Ubuntu and I can reproduce your (upstream version) numbers on Simple_Patch_Antenna.py
and MSL_NotchFilter.py
.
assuming I find a way to install all the dependencies with yum
Does Amazon Linux still use yum
instead of dnf
? Here's my guide for Fedora (I should submit it to the official documentation one day), hopefully the names would be mostly identical:
https://gist.github.com/biergaizi/7f45c243f6e17b509bd1ea785a6af9e7
I'm happy to do a quick comparison test once your PRs are merged
BTW, you don't have to wait. There's a simple trick to generate a patch that can be applied directly to the upstream from any pending GitHub pull request: just add ".patch" to the URL of the PR. For example, the patch URL of this PR is https://github.com/thliebig/openEMS/pull/105.patch
Does Amazon Linux still use
yum
instead ofdnf
?
Yes, Amazon Linux 2 uses yum
. It seems that Amazon Linux 2023 supports dnf
, so I tried on that AMI, but I'm probably doing something wrong (not really familiar with the dnf
package manager, might have to manually add some repositories?):
$ sudo dnf install cmake boost-devel tinyxml-devel vtk-devel hdf5-devel CGAL-devel python3-Cython python3-h5py python3-matplotlib octave-devel paraview
Last metadata expiration check: 0:04:06 ago on Tue Mar 14 23:38:47 2023.
No match for argument: tinyxml-devel
No match for argument: vtk-devel
No match for argument: hdf5-devel
No match for argument: CGAL-devel
No match for argument: python3-h5py
No match for argument: python3-matplotlib
No match for argument: octave-devel
No match for argument: paraview
Error: Unable to find a match: tinyxml-devel vtk-devel hdf5-devel CGAL-devel python3-h5py python3-matplotlib octave-devel paraview
$ yum repolist enabled
repo id repo name
amazonlinux Amazon Linux 2023 repository
kernel-livepatch Amazon Linux 2023 Kernel Livepatch repository
but I'm probably doing something wrong (not really familiar with the dnf package manager, might have to manually add some repositories?)
Just try installing the same packages with yum
instead of dnf
.
If it still doesn't work, it mean Amazon Linux is a complete different system and the Fedora packages are not directly applicable.
In this original publication, the "NXYZ" should've been called "XYZN", and vice versa. Unfortunately the names were accidentally swapped due to mistake. Correcting the wrong branch names requires replacing branches. This operation is not supported by GitHub, which causes the closure of the existing Pull Request, forcing me to recreate a new one. The new Pull Request can be found at #117.
Let's continue the future discussion at #117.
Well that naming error wasn't that bad... I could have easily lived with that mix-up ;)
Update (2023/05/20):
Abstract
In this Pull Request, a new implementation of multi-dimensional arrays is introduced in the file
flat_array_ops.h
, with the advantage of contiguous memory, as originally proposed in Issue #100. This is represented by a new data typeFlat_N_3DArray<T>
, created and destroyed byCreate_Flat_N_3DArray<T>()
andDelete_Flat_N_3DArray<T>()
. A template specialization is also provided to handle the special case ofCreate_Flat_N_3DArray<f4vector>
without the need of a separate_v4sf()
function.The uses of
Create_N_3DArray()
andCreate_N_3DArray_v4sf()
in the existing FDTD kernel and the UPML extension are converted to the newCreate_Flat_N_3DArray<T>()
implementation.The tradeoff between memory layout
NXYZ
andXYZN
is discussed, and theXYZN
layout was eventually selected.Benchmarks have shown a speedup around 20%-30% in many different simulation setups.
Programming
Disadvantages of Existing Multi-dimensional Arrays
Currently, the file
array_ops.h
provides helper functions to allocate 2D, 3D, and N_3D arrays by offeringCreate2DArray()
,Create3DArray()
, andCreate_N_3DArray()
. Further, the filearray_ops.cpp
provides additional functionsCreate3DArray_v4sf()
andCreate_N_3DArray_v4sf()
for allocating vectorized arrays.However, these functions have several disadvantages due to the use of a non-contiguous memory layout.
To allow the access of an element in a multi-dimensional array from a pointer using the C syntax
array_ptr[x][y][z]
, the multi-dimensional array is allocated in the following way: First, an allocation of an 1D array of pointers with size "x", each pointer is pointed to another dynamically-allocated 1D array of pointers with size "y", then each pointer within this array is pointed to another dynamically-allocated 1D array with size "z".This is essentially an unoptimized form of "iliffe vector", with significant performance bottleneck (image taken from Batched Cholesky factorization for tiny matrices by Florian Lemaitre and Lionel Lacassagne).
First, poor memory spatial locality. If two elements are stored next to each other in memory and shares the same cacheline, accessing the first element effectively fetches the next element at the same time, saving valuable memory bandwidth - this is advantageous since openEMS's FDTD kernel is a linear walk of all points in 3D space. Similarly, when an element in memory is accessed with a regular stride size, the CPU can correctly predict and prefetch the next elements.
However, in the current implementation, only the last dimension of a multi-dimensional array is stored contiguously in memory, what's found in between is bookkeeping data used internally by
malloc()
(it can be worse if the system is running with fragmented memory due to other workloads). As a result, a linear walk of the array cannot take full advantage of memory locality. When we're accessingarray[x][y][Z_MAX]
in a linear walk,array[x][y + 1][0]
should be prefetched, but this is not possible if the last dimension of the array is not contiguous in memory.Second, using such an iliffe vector means accessing each element in the array involves multiple levels of pointer dereferencing, to access a single element, multiple unrelated memory addresses must be read first, this further increases memory latency and reduces the effectiveness of the CPU prefetcher.
The most seriously affected array is the N_3D arrays. These arrays are effectively 4D arrays as N is currently hardcoded to 3. This kind of arrays are used to store the polarization of the electromagnetic field in the X, Y, Z axis at each point (A, B, C) in 3D space, and they're the most heavily used data structure.
New Arrays in
flat_array_ops.h
The new template functions in flat_array_ops.h provides a solution to these problems. It creates a new class named
Flat_N_3DArray<T>
. By calling the functionCreate_Flat_N_3DArray<T>
, it allocates a contiguous 1D array in memory.The
operator()
ofFlat_N_3DArray<T>
is then overloaded, allowing one to access the element transparently. This practice is consistent with other existing C++ libraries developed for scientific computing. It will be easy to replaceFlat_N_3DArray<T>
with another professional multi-dimensional array library if there's ever a need in the future.Furthermore, instead of creating an unnecessary
Create_N_3DArray_v4sf()
to handle vector data types, it instead uses a template specialization for the type f4vector. Thus, both scalar and vector arrays can be created using the same template syntax.Since the the most seriously affected array is the N_3D arrays, only the function
Create_Flat_N_3DArray<T>
at this time, and the plan is to replace all occurrence ofCreate_N_3DArray()
andCreate_N_3DArray_v4sf()
completely withCreate_Flat_N_3DArray()
. Currently, only the FDTD kernel and the UPML extension has been converted. In the future, the plan is to also replaceCreate3DArray()
andCreate2DArray()
by their "flat" equivalent too to ensure a consistent coding style, eventually deprecating the entire filearray_ops.h
.Micro-optimizations
The Flat_N_3DArray also contains three micro-optimizations.
Flexible array member
The last member of the struct Flat_N_3DArray is an array with a size of 1, the size of 1 is just a placeholder and its actual size is determined by how much memory is allocated beyond the end of the array at runtime via
malloc()
orposix_memalign()
. It's technically a standard-nonconforming undefined behavior, but is a well-known technique in system programming, usage can even be found in Unix and Windows itself. It has an important advantage of avoid the cost of extra dereferencing. On Unix, GCC/clang officially supports the "size 0" extension, so we use this well-defined option when it's available. C99 standardizedarray[]
for the same purpose but it's never added to C++.XYZN instead of NXYZ layout
In openEMS, many memory accesses have the pattern of
array(0, x, y, z)
,array(1, x, y, z)
,array(2, x, y, z)
. Experiment showed that storing the element N as the last dimension of the array has slightly better performance on most platforms, likely due to the fact that the cacheline hit rate is increased. The the vast majority of hardware platforms tested, the XYZN layout has a significant advantage.However, for some simulations on some platforms, this produces performance degradation instead of improvement. This point will be further discussed later.
Hardcoded N = 3
Currently, there's no need for an 4D array that stores more than 3 values at each point in 3D space. Thus, the first dimension 3 is hardcoded.
padding or alignment
No special padding or alignment was attempted in the array, except that all memory allocations starts at a multiple of 64 bytes, the cacheline size. Because of the header, array itself starts the 16-byte offset. Experiments showed aligning to 64 bytes produces more consistent performance. Quick experiments with other padding schemes, such as making the array itself to start at a multiple of 64 bytes, or inserting padding at the end of dimension Z, showed no improvement or a significant slowdown.
Using the New Arrays
To move the existing code
Create_N_3DArray()
andCreate_N_3DArray_v4sf()
toCreate_Flat_N_3DArray()
, the following change is needed. If the original code isIt should be replaced with:
However, since syntax
(*array)(n, x, y, z)
is ugly and confusing, it's recommended to keep the pointer to the array in a data structure, but when an access to the array is needed, the pointer is converted a reference at the beginning of a function, in a style similar to:Results
Hardware Configuration
To examine the effect of the optimization, benchmarking has been performed on 8 different CPU microarchitectures, six of them are virtual machines on Amazon Web Services, the other two are physical computers. Hardware configuration is summarized in the following table.
Test Method
In total, 14 simulations were selected to compare the performance of upstream code and patched code, including 7 Python scripts and 3 Octave script from the openEMS official demo, and 4 Python scripts from pyEMS. They're:
openEMS Python scripts: These are the official Python tutorials from the openEMS repository, including
Bent_Patch_Antenna.py
,Helical_Antenna.py
,RCS_Sphere.py
,Simple_Patch_Antenna.py
,CRLH_Extraction.py
,MSL_NotchFilter.py
,Rect_Waveguide.py
.openEMS Octave scripts: These are the official Octave tutorials from the openEMS repository, including
CRLH_LeakyWaveAnt.m
,CylindricalWave_CC.m
, andStripLine2MSL.m
.pyEMS Python scripts: From the unofficial pyEMS project. This project, developed by Matt Huszagh, aims to create a high-level framework for modeling circuit boards and components. Instead of creating geometry shapes, it would allow the creation of high-level objects like PCB layers, vias, resistors. At present, the project is still in a highly preliminary and experimental stage. Four demo scripts were used,
coupler.py
,gcpw.py
,microstrip_sma_transition.py
,rf_via.py
. They represent practical problems encountered by PCB designers.In order to test the FDTD engine, all post-processing steps, such as plotting or near-field to far-field transformation, have been deleted from the script. Other elements, such as dumpboxes, were kept as-is. The full benchmarking test suite can be found at this repository.
Each benchmark was repeatedly executed from 1 threads to 4 threads, and the entire benchmark process was repeated for 3 times. The fastest speed record for each script is used for comparison. The speed record for each script is used for comparison regardless of the number of threads, since different kind of simulations have different memory access patterns, the optimal number of thread is different.
Both memory layouts NXYZ and XYZN have been tested.
Most simulations have been programmed to end within 30 seconds, well before reaching sufficient number of timesteps to allow a reasoanble runtime. Otherwise, even with reduced timesteps or restricted maximum runtime, the full benchmark may take multiple days to finish. My experience suggests that the speed of a simulation usually does not vary during its runtime.
Speed
Discussion and Conclusion
After running these benchmarks, it can be concluded that:
20%-30% performance improvement is demostrated for many simulation tasks.
Some simulations showed little to no improvement, because they're bottlenecked by the field dump, this is mainly a problem for small simulation domains. But in some cases, also a problem for large simulation domains. Experiments showed a similar memory layout optimization allows a 5% to 10% speedup for field dumps, but it's beyond the scope of this Pull Request.
Occasionally, for one or two benchmarks, huge performance increases around 150% to 200% have been observed. These are not mistakes but real results. These situations occurs when the Z-dimension of the simulation domain is small. The performance penalty of non-contiguous memory allocations in these cases are particularly high, which is easily avoided in both new layouts.
For example,
MSL_NotchFilter.py
only has 14 cells on the Z axis, and one often sees a 150% speedup. Similarly,CylindricalWave_CC.m
only has 5 cells on the Z axis, on Haswell, the new memory layout runs 200% as fast.Nevertheless, these pathological outliers are likely not relevant for practical applications.
On the vast majority of the hardware platforms, using the NXYZ memory layout has a significant performance advantage than the XYZN memory layout, especially on AMD Zen 3 - although both layouts are faster the unoptimized layout.
Unfortunately, on Graviton3, although the NXYZ layout brings improvement in some simulations, but in others simulations (including simulations with only the FDTD kernel without any plugin), there's a 20% performance hit, making it even slower than unoptimized upstream code. On the other hand, the XYZN memory layout doesn't have this performance penalty.
However, I believe using NXYZ is a worthwhile tradeoff. In the future, both memory layouts can be made available using a runtime option. Though it's outside the scope of this pull request.
The performance of physical and virtual machines are not directly comparable. Due to limited number of cores and resource contention, a 4-core virtual machine is usually slower than even a desktop computer with similar CPU microarchitecture, in spite of increased memory bandwidth of servers.
AMD Zen 2 performed much slower than anticipated. It's likely due to the use of Ryzen 3 3100, the least expensive Zen 2 desktop processor. Its 4 cores are distributed across two silicon dies. This likely reduced cache hits and increased memory access overhead.
Future Work
Deprecating array_ops.h
After this patch, there are currently three seperate multi-dimensional array implementations in the codebase. This is redudant and confusing to new developers. In the future, I plan to convert all existing uses of
array_ops.h
toflat_array_ops.h
, including adding Flat 2D array and 3D array implementations. Thus, all uses of the old array_ops.h can be eliminated.Moving to std:mdspan
A standard multi-dimensional array implementation, called
std::mdspan
, will be added to C++23, with backports available for previous C++ versions. This array is specifically designed for scientific computing community, with many advantages in the context of HPC. See the US government presentation and the C++ standard committe report for more information. it's intended to be the new uniserval standard, with serious support from US supercomputing centers like Sandia, Oak Ridge, and Argonne National Lab.So far there's no particular performance reason to perform this conversion, but if there's a future need, the existing homegrown
flat_array_ops.h
code can be completely converted to usestd::mdspan
for the sake of aligning with current practice.Engine Architecture Improvement
Despite the micro-optimization, openEMS's architecture is still fundamentally disadvantaged in terms of performance. As an academic field solver, the idea is to keep the core engine as simple as possible (MIT's MEEP also used a similar design), with a simple kernel with standard FDTD equations, everything else like excitation signals, lossy metals, or PML/ABC, is implemented as plugins. To run a simulation, the flowchart is basically: run most points in 3D space through the plugins to do pre-update, run all points in 3D space again through the main FDTD kernel, and finally run most points in 3D space again through the plugin to do post-update. The extra redundant loads and stores create significant memory bandwidth overhead...
Now the problem is whether it can be optimized without sacrificing modularity. I'm now seeing two potential solution.
First is to use (single-thread) domain decomposition. Instead of going through the entire 3D space at a time, perhaps the update can be splitted into multiple blocks. Instead of running pre-update, main-update and post-update across the entire 3D space at once, as in:
It can perhaps be performed in a piecewise manner, allowing the main FDTD kernel and the plugins can reuse the data in CPU cache.
Another possible solution is allowing a plugin to insert itself dynamically into main field update loop of the main kernel, rather than using its own redundant loop.
Modularity can still be somewhat preserved by using function pointers instead of hardcoding extension into the main engine.
Both solutions also have the problem of losing the ability to randomly access a coherent electromagnetic field at any point in space from a plugin, so it cannot be applied to all extensions. However, it can perhaps at least be a fast path for plugins that don't need that, like PML.