Speeding Up FDTD Engine By 20%-30% by Using Multi-dimensional Arrays with Contiguous Memory

biergaizi commented 1 year ago

Update (2023/05/20):

In thin original publication, the "NXYZ" should've been called "XYZN", and vice verse. Unfortunately the names were accidentally swapped due to mistake. Correcting the wrong branch names requires replacing branches. This operation is not supported by GitHub, which causes the closure of the existing Pull Request, forcing me to recreate a new one. The new Pull Request can be found at #117.
Branch name mistakes have been fixed in text and graphs as well.
Relative instead of absolute speed is used to avoid off-scale problems.
Results of AMD Ryzen 9 7950X (Zen 4) have been added.

Abstract

In this Pull Request, a new implementation of multi-dimensional arrays is introduced in the file flat_array_ops.h, with the advantage of contiguous memory, as originally proposed in Issue #100. This is represented by a new data type Flat_N_3DArray<T>, created and destroyed by Create_Flat_N_3DArray<T>() and Delete_Flat_N_3DArray<T>(). A template specialization is also provided to handle the special case ofCreate_Flat_N_3DArray<f4vector> without the need of a separate _v4sf() function.

The uses of Create_N_3DArray() and Create_N_3DArray_v4sf() in the existing FDTD kernel and the UPML extension are converted to the new Create_Flat_N_3DArray<T>() implementation.

The tradeoff between memory layout NXYZ and XYZN is discussed, and the XYZN layout was eventually selected.

Benchmarks have shown a speedup around 20%-30% in many different simulation setups.

Programming

Disadvantages of Existing Multi-dimensional Arrays

Currently, the file array_ops.h provides helper functions to allocate 2D, 3D, and N_3D arrays by offering Create2DArray(), Create3DArray(), and Create_N_3DArray(). Further, the file array_ops.cpp provides additional functions Create3DArray_v4sf() and Create_N_3DArray_v4sf() for allocating vectorized arrays.

However, these functions have several disadvantages due to the use of a non-contiguous memory layout.

To allow the access of an element in a multi-dimensional array from a pointer using the C syntax array_ptr[x][y][z], the multi-dimensional array is allocated in the following way: First, an allocation of an 1D array of pointers with size "x", each pointer is pointed to another dynamically-allocated 1D array of pointers with size "y", then each pointer within this array is pointed to another dynamically-allocated 1D array with size "z".

iliffe vector example: array of 3x3 matrices aligned with 64 bytes

This is essentially an unoptimized form of "iliffe vector", with significant performance bottleneck (image taken from Batched Cholesky factorization for tiny matrices by Florian Lemaitre and Lionel Lacassagne).

First, poor memory spatial locality. If two elements are stored next to each other in memory and shares the same cacheline, accessing the first element effectively fetches the next element at the same time, saving valuable memory bandwidth - this is advantageous since openEMS's FDTD kernel is a linear walk of all points in 3D space. Similarly, when an element in memory is accessed with a regular stride size, the CPU can correctly predict and prefetch the next elements.

However, in the current implementation, only the last dimension of a multi-dimensional array is stored contiguously in memory, what's found in between is bookkeeping data used internally by malloc() (it can be worse if the system is running with fragmented memory due to other workloads). As a result, a linear walk of the array cannot take full advantage of memory locality. When we're accessing array[x][y][Z_MAX] in a linear walk, array[x][y + 1][0] should be prefetched, but this is not possible if the last dimension of the array is not contiguous in memory.

Second, using such an iliffe vector means accessing each element in the array involves multiple levels of pointer dereferencing, to access a single element, multiple unrelated memory addresses must be read first, this further increases memory latency and reduces the effectiveness of the CPU prefetcher.

The most seriously affected array is the N_3D arrays. These arrays are effectively 4D arrays as N is currently hardcoded to 3. This kind of arrays are used to store the polarization of the electromagnetic field in the X, Y, Z axis at each point (A, B, C) in 3D space, and they're the most heavily used data structure.

New Arrays in `flat_array_ops.h`

The new template functions in flat_array_ops.h provides a solution to these problems. It creates a new class named Flat_N_3DArray<T>. By calling the function Create_Flat_N_3DArray<T>, it allocates a contiguous 1D array in memory.

The operator() of Flat_N_3DArray<T> is then overloaded, allowing one to access the element transparently. This practice is consistent with other existing C++ libraries developed for scientific computing. It will be easy to replace Flat_N_3DArray<T> with another professional multi-dimensional array library if there's ever a need in the future.

Furthermore, instead of creating an unnecessary Create_N_3DArray_v4sf() to handle vector data types, it instead uses a template specialization for the type f4vector. Thus, both scalar and vector arrays can be created using the same template syntax.

Since the the most seriously affected array is the N_3D arrays, only the function Create_Flat_N_3DArray<T> at this time, and the plan is to replace all occurrence of Create_N_3DArray() and Create_N_3DArray_v4sf() completely with Create_Flat_N_3DArray(). Currently, only the FDTD kernel and the UPML extension has been converted. In the future, the plan is to also replace Create3DArray() and Create2DArray() by their "flat" equivalent too to ensure a consistent coding style, eventually deprecating the entire file array_ops.h.

Micro-optimizations

The Flat_N_3DArray also contains three micro-optimizations.

Flexible array member

The last member of the struct Flat_N_3DArray is an array with a size of 1, the size of 1 is just a placeholder and its actual size is determined by how much memory is allocated beyond the end of the array at runtime via malloc() or posix_memalign(). It's technically a standard-nonconforming undefined behavior, but is a well-known technique in system programming, usage can even be found in Unix and Windows itself. It has an important advantage of avoid the cost of extra dereferencing. On Unix, GCC/clang officially supports the "size 0" extension, so we use this well-defined option when it's available. C99 standardized array[] for the same purpose but it's never added to C++.
XYZN instead of NXYZ layout

In openEMS, many memory accesses have the pattern of array(0, x, y, z), array(1, x, y, z), array(2, x, y, z). Experiment showed that storing the element N as the last dimension of the array has slightly better performance on most platforms, likely due to the fact that the cacheline hit rate is increased. The the vast majority of hardware platforms tested, the XYZN layout has a significant advantage.

However, for some simulations on some platforms, this produces performance degradation instead of improvement. This point will be further discussed later.
Hardcoded N = 3

Currently, there's no need for an 4D array that stores more than 3 values at each point in 3D space. Thus, the first dimension 3 is hardcoded.
padding or alignment

No special padding or alignment was attempted in the array, except that all memory allocations starts at a multiple of 64 bytes, the cacheline size. Because of the header, array itself starts the 16-byte offset. Experiments showed aligning to 64 bytes produces more consistent performance. Quick experiments with other padding schemes, such as making the array itself to start at a multiple of 64 bytes, or inserting padding at the end of dimension Z, showed no improvement or a significant slowdown.

Using the New Arrays

To move the existing code Create_N_3DArray() and Create_N_3DArray_v4sf() to Create_Flat_N_3DArray(), the following change is needed. If the original code is

// initialization
FDTD_FLOAT**** array;
array = Create_N_3DArray<FDTD_FLOAT>(...);

// use
array[n][x][y][z] = 42;

// release
Delete_Flat_N_3DArray(array);

It should be replaced with:

// initialization
Flat_N_3DArray<FDTD_FLOAT>* array;
array = Create_Flat_N_3DArray<FDTD_FLOAT>(...);

// use
(*array)(n, x, y, z) = 42;

// release
Delete_Flat_N_3DArray(array);

However, since syntax (*array)(n, x, y, z) is ugly and confusing, it's recommended to keep the pointer to the array in a data structure, but when an access to the array is needed, the pointer is converted a reference at the beginning of a function, in a style similar to:

// initialization
Flat_N_3DArray<FDTD_FLOAT>* array_ptr;
array_ptr = Create_Flat_N_3DArray<FDTD_FLOAT>(...);

// use
Flat_N_3DArray<FDTD_FLOAT>& array = *array_ptr;
array(n, x, y, z) = 42;

// release
Delete_Flat_N_3DArray(array_ptr);

Results

Hardware Configuration

To examine the effect of the optimization, benchmarking has been performed on 8 different CPU microarchitectures, six of them are virtual machines on Amazon Web Services, the other two are physical computers. Hardware configuration is summarized in the following table.

Nickname	Type	CPU	Cores	Memory
sandybridge	Physical	Core i7-2600K @ 3.4 GHz	4C/8T	DDR3-1600 Dual-Channel
zen2	Physical	Ryzen 3 3100 @ 3.6 GHz	4C/8T	DDR4-3200 Single-Channel
haswell	c4.xlarge	Xeon E5-2666 v3 @ 2.9 GHz	2C/4T	DDR4
skylake	c5n.xlarge	Xeon Platinum 8124M @ 3.0 GHz	2C/4T	DDR4
icelake	c6i.xlarge	Xeon Platinum 8375C @ 2.9 GHz	2C/4T	DDR4
zen3	c6a.xlarge	AMD EPYC 7R13 @ 3.0 GHz	2C/4T	DDR4
graviton2	c6g.xlarge	ARM Neoverse-N1 @ 2.5 GHz	4C	DDR4
graviton3	c7g.xlarge	ARM Neoverse-V1 @ 2.6 GHz	4C	DDR5
zen4-7950x	Physical	Ryzen 9 7950X @ 4.5 GHz	16C/32T	DDR5-4800 Dual-Channel

Test Method

In total, 14 simulations were selected to compare the performance of upstream code and patched code, including 7 Python scripts and 3 Octave script from the openEMS official demo, and 4 Python scripts from pyEMS. They're:

openEMS Python scripts: These are the official Python tutorials from the openEMS repository, including Bent_Patch_Antenna.py, Helical_Antenna.py, RCS_Sphere.py, Simple_Patch_Antenna.py, CRLH_Extraction.py, MSL_NotchFilter.py, Rect_Waveguide.py.
openEMS Octave scripts: These are the official Octave tutorials from the openEMS repository, including CRLH_LeakyWaveAnt.m, CylindricalWave_CC.m, and StripLine2MSL.m.
pyEMS Python scripts: From the unofficial pyEMS project. This project, developed by Matt Huszagh, aims to create a high-level framework for modeling circuit boards and components. Instead of creating geometry shapes, it would allow the creation of high-level objects like PCB layers, vias, resistors. At present, the project is still in a highly preliminary and experimental stage. Four demo scripts were used, coupler.py, gcpw.py, microstrip_sma_transition.py, rf_via.py. They represent practical problems encountered by PCB designers.

In order to test the FDTD engine, all post-processing steps, such as plotting or near-field to far-field transformation, have been deleted from the script. Other elements, such as dumpboxes, were kept as-is. The full benchmarking test suite can be found at this repository.

Each benchmark was repeatedly executed from 1 threads to 4 threads, and the entire benchmark process was repeated for 3 times. The fastest speed record for each script is used for comparison. The speed record for each script is used for comparison regardless of the number of threads, since different kind of simulations have different memory access patterns, the optimal number of thread is different.

Both memory layouts NXYZ and XYZN have been tested.

Most simulations have been programmed to end within 30 seconds, well before reaching sufficient number of timesteps to allow a reasoanble runtime. Otherwise, even with reduced timesteps or restricted maximum runtime, the full benchmark may take multiple days to finish. My experience suggests that the speed of a simulation usually does not vary during its runtime.

Speed

sandybridge

haswell

zen2

skylake

icelake

zen3

graviton2

graviton3

zen4-7950x

Discussion and Conclusion

After running these benchmarks, it can be concluded that:

20%-30% performance improvement is demostrated for many simulation tasks.
Some simulations showed little to no improvement, because they're bottlenecked by the field dump, this is mainly a problem for small simulation domains. But in some cases, also a problem for large simulation domains. Experiments showed a similar memory layout optimization allows a 5% to 10% speedup for field dumps, but it's beyond the scope of this Pull Request.
Occasionally, for one or two benchmarks, huge performance increases around 150% to 200% have been observed. These are not mistakes but real results. These situations occurs when the Z-dimension of the simulation domain is small. The performance penalty of non-contiguous memory allocations in these cases are particularly high, which is easily avoided in both new layouts.

For example, MSL_NotchFilter.py only has 14 cells on the Z axis, and one often sees a 150% speedup. Similarly, CylindricalWave_CC.m only has 5 cells on the Z axis, on Haswell, the new memory layout runs 200% as fast.

Nevertheless, these pathological outliers are likely not relevant for practical applications.
On the vast majority of the hardware platforms, using the NXYZ memory layout has a significant performance advantage than the XYZN memory layout, especially on AMD Zen 3 - although both layouts are faster the unoptimized layout.

Unfortunately, on Graviton3, although the NXYZ layout brings improvement in some simulations, but in others simulations (including simulations with only the FDTD kernel without any plugin), there's a 20% performance hit, making it even slower than unoptimized upstream code. On the other hand, the XYZN memory layout doesn't have this performance penalty.

However, I believe using NXYZ is a worthwhile tradeoff. In the future, both memory layouts can be made available using a runtime option. Though it's outside the scope of this pull request.
The performance of physical and virtual machines are not directly comparable. Due to limited number of cores and resource contention, a 4-core virtual machine is usually slower than even a desktop computer with similar CPU microarchitecture, in spite of increased memory bandwidth of servers.
AMD Zen 2 performed much slower than anticipated. It's likely due to the use of Ryzen 3 3100, the least expensive Zen 2 desktop processor. Its 4 cores are distributed across two silicon dies. This likely reduced cache hits and increased memory access overhead.

Future Work

Deprecating array_ops.h

After this patch, there are currently three seperate multi-dimensional array implementations in the codebase. This is redudant and confusing to new developers. In the future, I plan to convert all existing uses of array_ops.h to flat_array_ops.h, including adding Flat 2D array and 3D array implementations. Thus, all uses of the old array_ops.h can be eliminated.

Moving to std:mdspan

A standard multi-dimensional array implementation, called std::mdspan, will be added to C++23, with backports available for previous C++ versions. This array is specifically designed for scientific computing community, with many advantages in the context of HPC. See the US government presentation and the C++ standard committe report for more information. it's intended to be the new uniserval standard, with serious support from US supercomputing centers like Sandia, Oak Ridge, and Argonne National Lab.

So far there's no particular performance reason to perform this conversion, but if there's a future need, the existing homegrown flat_array_ops.h code can be completely converted to use std::mdspan for the sake of aligning with current practice.

Engine Architecture Improvement

Despite the micro-optimization, openEMS's architecture is still fundamentally disadvantaged in terms of performance. As an academic field solver, the idea is to keep the core engine as simple as possible (MIT's MEEP also used a similar design), with a simple kernel with standard FDTD equations, everything else like excitation signals, lossy metals, or PML/ABC, is implemented as plugins. To run a simulation, the flowchart is basically: run most points in 3D space through the plugins to do pre-update, run all points in 3D space again through the main FDTD kernel, and finally run most points in 3D space again through the plugin to do post-update. The extra redundant loads and stores create significant memory bandwidth overhead...

Now the problem is whether it can be optimized without sacrificing modularity. I'm now seeing two potential solution.

First is to use (single-thread) domain decomposition. Instead of going through the entire 3D space at a time, perhaps the update can be splitted into multiple blocks. Instead of running pre-update, main-update and post-update across the entire 3D space at once, as in:

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            pre_update(x, y, z)
        }
    }
}

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            fdtd_kernel(x, y, z)
        }
    }
}

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            post_update(x, y, z)
        }
    }
}

It can perhaps be performed in a piecewise manner, allowing the main FDTD kernel and the plugins can reuse the data in CPU cache.

int x_start[], x_end[];
int y_start[], y_end[];
int z_start[], z_end[];
int domain_len;

for (int domain = 0; domain < domain_len; domain++) {
    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                pre_update(x, y, z)
            }
        }
    }

    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                fdtd_kernel(x, y, z)
            }
        }
    }
    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                post_update(x, y, z)
            }
        }
    }
}

Another possible solution is allowing a plugin to insert itself dynamically into main field update loop of the main kernel, rather than using its own redundant loop.

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            if (has_extension(x, y, z)) {
                pre_update(x, y, z);
                fdtd_kernel(x, y, z);
                post_update(x, y, z);
            }
            else {
                fdtd_kernel(x, y, z)
            }
        }
    }
}

Modularity can still be somewhat preserved by using function pointers instead of hardcoding extension into the main engine.

Both solutions also have the problem of losing the ability to randomly access a coherent electromagnetic field at any point in space from a plugin, so it cannot be applied to all extensions. However, it can perhaps at least be a fast path for plugins that don't need that, like PML.

thliebig commented 1 year ago

Okay, this such an amount of work and information. This will take some time to process. Furthermore I really should release a v0.0.36 prio to applying this I guess and have all this changes for the next version then...

Let me make some comments to you idea for future work. I have some ideas on that topic too, which would attack the problem from yet another angle, e.g. by reducing the amount of coefficients by spending some more calculations. This ideally with a better/new run-length encoding of the coefficients and then having possibly multiple different engines for different requirements (e.g. pml vs lossless vs lossy vs ... spaces). But all of this would ideally done in a "new" engine, maybe using AVX as well...

By the way, what you propose in the "Engine Architecture Improvement" of calling (non inline) functions and if clauses in the inner most loop should be a performance nogo? Those permanent context switches should be very bad I think...

biergaizi commented 1 year ago

This ideally with a better/new run-length encoding of the coefficients and then having possibly multiple different engines for different requirements (e.g. pml vs lossless vs lossy vs ... spaces).

I was also thinking about the idea of floating-point compression to reduce memory traffic. The so-called "compressed operators" for now is just "deduplicated operators" with no real compression. I've read some papers on this topic. A lot of work has been done in the AI/ML field with the emphasis of half-precision floating point, which is 100% unacceptable for scientific computing. But the result of some papers can be applied to single and double-precision numbers as well, however, another roadblock is that most compression schemes are lossy compression, so the result would still be worse than single-precision floating point. So I don't think it would be an acceptable solution. Furthermore, many are difficult to apply without custom hardware.

We must find a way to compress the coefficients losslessly, and the overhead of decoding/decompression must be low enough to get a net speedup.

Another quick and simple optimization I've seen is the vacuum optimization. In many simulations, empty space occupies a large portion in the simulation domain. These locations have a fixed permeability and permittivity, a special case can be added in the FDTD kernel to allow direct calculation without accessing coefficients. Though I'm still not sure whether the idea is applicable in openEMS.

calling (non inline) functions and if clauses in the inner most loop should be a performance nogo?

I'm aware of that, function calls and conditional branching can be expensive. My hope for is that, despite these two sources of overheads, the final result would still be a net speedup due to less data movements and reduced back-end pressure on memory.

Now another practical problem I'm seeing is that, the proposed change doesn't interact well with the priority system. In the existing engine, PML has the highest priority, and the cylindrical extension has the second highest priority. But if PML calculations are fused into the main kernel, necessarily, its priority would be lower than the cylindrical extension, making them incompatible.

biergaizi commented 1 year ago

Okay, this such an amount of work and information. This will take some time to process. Furthermore I really should release a v0.0.36 prio to applying this I guess and have all this changes for the next version then...

Also, because this PR is built upon PR #101, which reformatted the coding styles of all three FDTD engines, I strongly suggesting reviewing and merging PR #101 ("FDTD: reformat code of update equations.") first, before reviewing and merge this main work. This way, the diff view will only show the real differences, instead of marking everything as fresh changes because of code reformatting.

The actual changes involved in this Pull Request is in fact minimal. The only real change in introducing flat_array_ops.h, all the other changes are simply replacing array[n][x][y][z] with array = *array_ptr; array(n, x, y, z).

0xCoto commented 1 year ago

By the way which OS Image did you try on c7g.xlarge? Do you think Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume would outperform Ubuntu considering Amazon supposedly optimizes its OS for their hardware?

Any suggestions on storage?	Configure Storage
General purpose SSD (gp3)
General purpose SSD (gp2) [default]
Provisioned IOPS SSD (io1)
Provisioned IOPS SSD (io2)
Cold HDD (sc1)
Throughput Optimized HDD (st1)
Magnetic (standard)

(I assume no other "Advanced details" have to be adjusted from the EC2 launch instance menu for optimal performance)

biergaizi commented 1 year ago

By the way which OS Image did you try on c7g.xlarge? Do you think Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume would outperform Ubuntu considering Amazon supposedly optimizes its OS for their hardware?

For consistency, all of these tests with the exception of Zen 2 (my home server), used Ubuntu. Also, no CPU-specific optimization flags were used. If Amazon Linux uses a fork of GCC with better code generation and vectorization for Graviton, it can possibly outperform Ubuntu. But I haven't checked if it's the case.

Any suggestions on storage?

Storage in general is not the bottleneck. The default is okay.

0xCoto commented 1 year ago

But I haven't checked if it's the case.

I'm happy to do a quick comparison test once your PRs are merged (assuming I find a way to install all the dependencies with yum...). For now, I just tested on Ubuntu and I can reproduce your (upstream version) numbers on Simple_Patch_Antenna.py and MSL_NotchFilter.py.

biergaizi commented 1 year ago

assuming I find a way to install all the dependencies with yum

Does Amazon Linux still use yum instead of dnf? Here's my guide for Fedora (I should submit it to the official documentation one day), hopefully the names would be mostly identical:

https://gist.github.com/biergaizi/7f45c243f6e17b509bd1ea785a6af9e7

biergaizi commented 1 year ago

I'm happy to do a quick comparison test once your PRs are merged

BTW, you don't have to wait. There's a simple trick to generate a patch that can be applied directly to the upstream from any pending GitHub pull request: just add ".patch" to the URL of the PR. For example, the patch URL of this PR is https://github.com/thliebig/openEMS/pull/105.patch

0xCoto commented 1 year ago

Does Amazon Linux still use yum instead of dnf?

Yes, Amazon Linux 2 uses yum. It seems that Amazon Linux 2023 supports dnf, so I tried on that AMI, but I'm probably doing something wrong (not really familiar with the dnf package manager, might have to manually add some repositories?):

$ sudo dnf install cmake boost-devel tinyxml-devel vtk-devel hdf5-devel CGAL-devel python3-Cython python3-h5py python3-matplotlib octave-devel paraview

Last metadata expiration check: 0:04:06 ago on Tue Mar 14 23:38:47 2023.
No match for argument: tinyxml-devel
No match for argument: vtk-devel
No match for argument: hdf5-devel
No match for argument: CGAL-devel
No match for argument: python3-h5py
No match for argument: python3-matplotlib
No match for argument: octave-devel
No match for argument: paraview
Error: Unable to find a match: tinyxml-devel vtk-devel hdf5-devel CGAL-devel python3-h5py python3-matplotlib octave-devel paraview

$ yum repolist enabled

repo id                                       repo name
amazonlinux                                   Amazon Linux 2023 repository
kernel-livepatch                              Amazon Linux 2023 Kernel Livepatch repository

biergaizi commented 1 year ago

but I'm probably doing something wrong (not really familiar with the dnf package manager, might have to manually add some repositories?)

Just try installing the same packages with yum instead of dnf.

If it still doesn't work, it mean Amazon Linux is a complete different system and the Fedora packages are not directly applicable.

biergaizi commented 1 year ago

In this original publication, the "NXYZ" should've been called "XYZN", and vice versa. Unfortunately the names were accidentally swapped due to mistake. Correcting the wrong branch names requires replacing branches. This operation is not supported by GitHub, which causes the closure of the existing Pull Request, forcing me to recreate a new one. The new Pull Request can be found at #117.

Let's continue the future discussion at #117.

thliebig commented 1 year ago

Well that naming error wasn't that bad... I could have easily lived with that mix-up ;)

thliebig / openEMS