thliebig / openEMS

openEMS is a free and open-source electromagnetic field solver using the EC-FDTD method.
http://openEMS.de
GNU General Public License v3.0
413 stars 146 forks source link

[corrected version] Speeding Up FDTD Engine By 20%-30% by Using Multi-dimensional Arrays with Contiguous Memory #117

Closed biergaizi closed 9 months ago

biergaizi commented 1 year ago

Update (2023/05/20)

Note

This PR is built upon PR #101, which reformatted the coding styles of all three FDTD engines, I strongly suggesting reviewing and merging PR #101 ("FDTD: reformat code of update equations.") first, before reviewing and merge this main work. This way, the diff view will only show the real differences, instead of marking everything as fresh changes because of code reformatting.

The actual changes involved in this Pull Request is in fact minimal. The only real change in introducing flat_array_ops.h, all the other changes are simply replacing array[n][x][y][z] with array = *array_ptr; array(n, x, y, z).

Abstract

In this Pull Request, a new implementation of multi-dimensional arrays is introduced in the file flat_array_ops.h, with the advantage of contiguous memory, as originally proposed in Issue #100. This is represented by a new data type Flat_N_3DArray<T>, created and destroyed by Create_Flat_N_3DArray<T>() and Delete_Flat_N_3DArray<T>(). A template specialization is also provided to handle the special case ofCreate_Flat_N_3DArray<f4vector> without the need of a separate _v4sf() function.

The uses of Create_N_3DArray() and Create_N_3DArray_v4sf() in the existing FDTD kernel and the UPML extension are converted to the new Create_Flat_N_3DArray<T>() implementation.

The tradeoff between memory layout NXYZ and XYZN is discussed, and the XYZN layout was eventually selected.

Benchmarks have shown a speedup around 20%-30% in many different simulation setups.

Programming

Disadvantages of Existing Multi-dimensional Arrays

Currently, the file array_ops.h provides helper functions to allocate 2D, 3D, and N_3D arrays by offering Create2DArray(), Create3DArray(), and Create_N_3DArray(). Further, the file array_ops.cpp provides additional functions Create3DArray_v4sf() and Create_N_3DArray_v4sf() for allocating vectorized arrays.

However, these functions have several disadvantages due to the use of a non-contiguous memory layout.

To allow the access of an element in a multi-dimensional array from a pointer using the C syntax array_ptr[x][y][z], the multi-dimensional array is allocated in the following way: First, an allocation of an 1D array of pointers with size "x", each pointer is pointed to another dynamically-allocated 1D array of pointers with size "y", then each pointer within this array is pointed to another dynamically-allocated 1D array with size "z".

iliffe vector example: array of 3x3 matrices aligned with 64 bytes

This is essentially an unoptimized form of "iliffe vector", with significant performance bottleneck (image taken from Batched Cholesky factorization for tiny matrices by Florian Lemaitre and Lionel Lacassagne).

First, poor memory spatial locality. If two elements are stored next to each other in memory and shares the same cacheline, accessing the first element effectively fetches the next element at the same time, saving valuable memory bandwidth - this is advantageous since openEMS's FDTD kernel is a linear walk of all points in 3D space. Similarly, when an element in memory is accessed with a regular stride size, the CPU can correctly predict and prefetch the next elements.

However, in the current implementation, only the last dimension of a multi-dimensional array is stored contiguously in memory, what's found in between is bookkeeping data used internally by malloc() (it can be worse if the system is running with fragmented memory due to other workloads). As a result, a linear walk of the array cannot take full advantage of memory locality. When we're accessing array[x][y][Z_MAX] in a linear walk, array[x][y + 1][0] should be prefetched, but this is not possible if the last dimension of the array is not contiguous in memory.

Second, using such an iliffe vector means accessing each element in the array involves multiple levels of pointer dereferencing, to access a single element, multiple unrelated memory addresses must be read first, this further increases memory latency and reduces the effectiveness of the CPU prefetcher.

The most seriously affected array is the N_3D arrays. These arrays are effectively 4D arrays as N is currently hardcoded to 3. This kind of arrays are used to store the polarization of the electromagnetic field in the X, Y, Z axis at each point (A, B, C) in 3D space, and they're the most heavily used data structure.

New Arrays in flat_array_ops.h

The new template functions in flat_array_ops.h provides a solution to these problems. It creates a new class named Flat_N_3DArray<T>. By calling the function Create_Flat_N_3DArray<T>, it allocates a contiguous 1D array in memory.

The operator() of Flat_N_3DArray<T> is then overloaded, allowing one to access the element transparently. This practice is consistent with other existing C++ libraries developed for scientific computing. It will be easy to replace Flat_N_3DArray<T> with another professional multi-dimensional array library if there's ever a need in the future.

Furthermore, instead of creating an unnecessary Create_N_3DArray_v4sf() to handle vector data types, it instead uses a template specialization for the type f4vector. Thus, both scalar and vector arrays can be created using the same template syntax.

Since the the most seriously affected array is the N_3D arrays, only the function Create_Flat_N_3DArray<T> at this time, and the plan is to replace all occurrence of Create_N_3DArray() and Create_N_3DArray_v4sf() completely with Create_Flat_N_3DArray(). Currently, only the FDTD kernel and the UPML extension has been converted. In the future, the plan is to also replace Create3DArray() and Create2DArray() by their "flat" equivalent too to ensure a consistent coding style, eventually deprecating the entire file array_ops.h.

Micro-optimizations

The Flat_N_3DArray also contains three micro-optimizations.

  1. Flexible array member

    The last member of the struct Flat_N_3DArray is an array with a size of 1, the size of 1 is just a placeholder and its actual size is determined by how much memory is allocated beyond the end of the array at runtime via malloc() or posix_memalign(). It's technically a standard-nonconforming undefined behavior, but is a well-known technique in system programming, usage can even be found in Unix and Windows itself. It has an important advantage of avoid the cost of extra dereferencing. On Unix, GCC/clang officially supports the "size 0" extension, so we use this well-defined option when it's available. C99 standardized array[] for the same purpose but it's never added to C++.

  2. XYZN instead of NXYZ layout

    In openEMS, many memory accesses have the pattern of array(0, x, y, z), array(1, x, y, z), array(2, x, y, z). Experiment showed that storing the element N as the last dimension of the array has slightly better performance on most platforms, likely due to the fact that the cacheline hit rate is increased. The the vast majority of hardware platforms tested, the XYZN layout has a significant advantage.

    However, for some simulations on some platforms, this produces performance degradation instead of improvement. This point will be further discussed later.

  3. Hardcoded N = 3

    Currently, there's no need for an 4D array that stores more than 3 values at each point in 3D space. Thus, the first dimension 3 is hardcoded.

  4. padding or alignment

    No special padding or alignment was attempted in the array, except that all memory allocations starts at a multiple of 64 bytes, the cacheline size. Because of the header, array itself starts the 16-byte offset. Experiments showed aligning to 64 bytes produces more consistent performance. Quick experiments with other padding schemes, such as making the array itself to start at a multiple of 64 bytes, or inserting padding at the end of dimension Z, showed no improvement or a significant slowdown.

Using the New Arrays

To move the existing code Create_N_3DArray() and Create_N_3DArray_v4sf() to Create_Flat_N_3DArray(), the following change is needed. If the original code is

// initialization
FDTD_FLOAT**** array;
array = Create_N_3DArray<FDTD_FLOAT>(...);

// use
array[n][x][y][z] = 42;

// release
Delete_Flat_N_3DArray(array);

It should be replaced with:

// initialization
Flat_N_3DArray<FDTD_FLOAT>* array;
array = Create_Flat_N_3DArray<FDTD_FLOAT>(...);

// use
(*array)(n, x, y, z) = 42;

// release
Delete_Flat_N_3DArray(array);

However, since syntax (*array)(n, x, y, z) is ugly and confusing, it's recommended to keep the pointer to the array in a data structure, but when an access to the array is needed, the pointer is converted a reference at the beginning of a function, in a style similar to:

// initialization
Flat_N_3DArray<FDTD_FLOAT>* array_ptr;
array_ptr = Create_Flat_N_3DArray<FDTD_FLOAT>(...);

// use
Flat_N_3DArray<FDTD_FLOAT>& array = *array_ptr;
array(n, x, y, z) = 42;

// release
Delete_Flat_N_3DArray(array_ptr);

Results

Hardware Configuration

To examine the effect of the optimization, benchmarking has been performed on 8 different CPU microarchitectures, six of them are virtual machines on Amazon Web Services, the other two are physical computers. Hardware configuration is summarized in the following table.

Nickname Type CPU Cores Memory
sandybridge Physical Core i7-2600K @ 3.4 GHz 4C/8T DDR3-1600 Dual-Channel
zen2 Physical Ryzen 3 3100 @ 3.6 GHz 4C/8T DDR4-3200 Single-Channel
haswell c4.xlarge Xeon E5-2666 v3 @ 2.9 GHz 2C/4T DDR4
skylake c5n.xlarge Xeon Platinum 8124M @ 3.0 GHz 2C/4T DDR4
icelake c6i.xlarge Xeon Platinum 8375C @ 2.9 GHz 2C/4T DDR4
zen3 c6a.xlarge AMD EPYC 7R13 @ 3.0 GHz 2C/4T DDR4
graviton2 c6g.xlarge ARM Neoverse-N1 @ 2.5 GHz 4C DDR4
graviton3 c7g.xlarge ARM Neoverse-V1 @ 2.6 GHz 4C DDR5
zen4-7950x Physical Ryzen 9 7950X @ 4.5 GHz 16C/32T DDR5-4800 Dual-Channel
zen4-7950x3d Physical Ryzen 9 7950X3D @ 4.2 GHz 16C/32T
(CCD1/CCD2 separately tested)
DDR5-4800 Dual-Channel
apple-m1 Physical Apple M1 @ 3.2 GHz 4P+4E/8T LPDDR4X-4266 Quad-Channel
(Half as wide as DDR4)
m1-ultra Physical Apple M1 Ultra @ 3.2 GHz 16P+4E/20T LPDDR5 32-Channel
(Up to ~400 GB/s the CPU!)

Test Method

In total, 14 simulations were selected to compare the performance of upstream code and patched code, including 7 Python scripts and 3 Octave script from the openEMS official demo, and 4 Python scripts from pyEMS. They're:

In order to test the FDTD engine, all post-processing steps, such as plotting or near-field to far-field transformation, have been deleted from the script. Other elements, such as dumpboxes, were kept as-is. The full benchmarking test suite can be found at this repository.

Each benchmark was repeatedly executed from 1 threads to 4 threads, and the entire benchmark process was repeated for 3 times. The fastest speed record for each script is used for comparison. The speed record for each script is used for comparison regardless of the number of threads, since different kind of simulations have different memory access patterns, the optimal number of thread is different.

Both memory layouts NXYZ and XYZN have been tested.

Most simulations have been programmed to end within 30 seconds, well before reaching sufficient number of timesteps to allow a reasoanble runtime. Otherwise, even with reduced timesteps or restricted maximum runtime, the full benchmark may take multiple days to finish. My experience suggests that the speed of a simulation usually does not vary during its runtime.

Speed

sandybridge

haswell

zen2

skylake

icelake

zen3

graviton2

graviton3

zen4-7950x

zen4-7950x3d CCD1

zen4-7950x3d CCD2

apple-m1

apple-m1-ultra

Discussion and Conclusion

After running these benchmarks, the conclusions are...

Hardware

  1. openEMS failed to show expected performance improvement when running on Apple's ultra-high memory bandwidth M1 Ultra system (with 800 GB/s DRAM bandwidth, and ~400 GB/s achievable with CPU-workload only). The chart below is a performance comparison between the AMD Ryzen 9 7950X and the Apple M1 Ultra.

Although in a few simulations, 2x speedups exist, but there speedups are not as great as one would expect from the increase of memory bandwidth. Combined with the fact that Apple M1 Ultra is known to have low memory bandwidth per core, and many cores are required to saturate it. Thus, this suggests that openEMS's multi-threading implementation is a bottleneck. Using a spatial tiling algorithm may help solving this problem, I'm working on this problem. But for now, this remains the bottleneck, and openEMS is unable to fully utilize memory bandwidth, even with the optimized array layout.

In the rest of the simulations, there are no speedups and even slowdowns. Possibly, because these simulations are small, thus they already fit in cache, so the bottleneck becomes the CPU clock frequency again.

AMD 7950X vs. Apple M1 Ultra

  1. openEMS failed to show performance improvement when running on AMD's 3D V-Cache CPU, the Zen 4 7950X3D (with 96 MiB Last-Level Cache). There are little differences between pinning openEMS on CCD1 (with 3D V-Cache) or CCD2 (without 3D V-Cache). In fact, the V-Cacheless CCD2 is slightly faster than CCD1.

Again, my hypothesis is that for small-to-medium sized runs (<= 1 million cells), it already more or less fits in a moderately-sized cache, and the remaining is mostly software overhead, multi-threading code bottleneck and CPU clock frequency. For large runs, the high cache replacement rate and software overhead basically negate the improvement, so added cache has no effect.

Using spatial and temporal tiling algorithms may help solving this problem, I'm working on this problem. But for now, openEMS cannot utilize the benefits of the 3D V-Cache.

There are also some anomalies, even in a 7950X3D CCD2 (non-V-cache) run, a few tests is way faster than 7950X, by 50%. Because the 7950X test didn't use core affinity, this is possibly a factor that creates the discrepancy. Distro / library version may be another. A retest on the 7950X under similar conditions is need to explain the result.

7950X3D, CCD1 vs. CCD2

  1. AMD Zen 2 performed much slower than anticipated. It's likely due to the use of Ryzen 3 3100, the least expensive Zen 2 desktop processor. Its 4 cores are partitioned into 2 regions on the die. This likely reduced cache hits and increased memory access overhead.

Software

  1. 20%-30% performance improvement is demostrated for many simulation tasks.

  2. Some simulations showed little to no improvement, likely because the simulation domains are too small, and memory bandwidth is not a bottleneck. In these cases, field dump can become the bottleneck. But in some others cases, it's also a problem for large simulation domains. Experiments showed a similar memory layout optimization allows a 5% to 10% speedup for field dumps, but it's beyond the scope of this Pull Request.

  3. Occasionally, for one or two benchmarks, huge performance increases around 150% to 200% have been observed. These are not mistakes but real results. These situations occurs when the Z-dimension of the simulation domain is small. The performance penalty of non-contiguous memory allocations in these cases are particularly high, which is easily avoided in both new layouts.

    For example, MSL_NotchFilter.py only has 14 cells on the Z axis, and one often sees a 150% speedup. Similarly, CylindricalWave_CC.m only has 5 cells on the Z axis, on Haswell, the new memory layout runs 200% as fast.

    Nevertheless, these pathological outliers are likely not relevant for practical applications.

  4. On the vast majority of the hardware platforms, using the XYZN memory layout has a significant performance advantage than the NXYZ memory layout, especially on AMD Zen 3 - although both layouts are faster than the unoptimized layout.

    Unfortunately, on Graviton3, although the NXYZ layout brings improvement in some simulations, but in others simulations (including simulations with only the FDTD kernel without any plugin), there's a 20% performance hit, making it even slower than unoptimized upstream code. On the other hand, the XYZN memory layout doesn't have this performance penalty.

    However, I believe using NXYZ is a worthwhile tradeoff. In the future, both memory layouts can be made available using a runtime option. Though it's outside the scope of this pull request.

  5. The performance of physical and virtual machines are not directly comparable. Due to limited number of cores and resource contention, a 4-core virtual machine is usually slower than even a desktop computer with similar CPU microarchitecture, in spite of increased memory bandwidth of servers.

Future Work

Deprecating array_ops.h

After this patch, there are currently three seperate multi-dimensional array implementations in the codebase. This is redudant and confusing to new developers. In the future, I plan to convert all existing uses of array_ops.h to flat_array_ops.h, including adding Flat 2D array and 3D array implementations. Thus, all uses of the old array_ops.h can be eliminated.

Moving to std:mdspan

A standard multi-dimensional array implementation, called std::mdspan, will be added to C++23, with backports available for previous C++ versions. This array is specifically designed for scientific computing community, with many advantages in the context of HPC. See the US government presentation and the C++ standard committe report for more information. it's intended to be the new uniserval standard, with serious support from US supercomputing centers like Sandia, Oak Ridge, and Argonne National Lab.

So far there's no particular performance reason to perform this conversion, but if there's a future need, the existing homegrown flat_array_ops.h code can be completely converted to use std::mdspan for the sake of aligning with current practice.

Engine Architecture Improvement

Despite the micro-optimization, openEMS's architecture is still fundamentally disadvantaged in terms of performance. As an academic field solver, the idea is to keep the core engine as simple as possible (MIT's MEEP also used a similar design), with a simple kernel with standard FDTD equations, everything else like excitation signals, lossy metals, or PML/ABC, is implemented as plugins. To run a simulation, the flowchart is basically: run most points in 3D space through the plugins to do pre-update, run all points in 3D space again through the main FDTD kernel, and finally run most points in 3D space again through the plugin to do post-update. The extra redundant loads and stores create significant memory bandwidth overhead... There's also a need to synchronize all threads between each extension, as any thread can modify the global state of the simulation domain, I later found this creates a significant overhead and is one cause of the poor multi-thread scaling.

A further possible improvement is temporal tiling (time skewing) - instead of calculating one timestep at a time, multiple timesteps are calculated at once. This greatly reduces the cache replacement rate for boosting performance. It would also allow us to utilize the benefits of AMD's 3D V-cache CPUs.

Now the problem is whether it can be optimized without sacrificing modularity. I'm now seeing two potential solution.

First is to use (single-thread) domain decomposition. Instead of going through the entire 3D space at a time, perhaps the update can be splitted into multiple blocks. Instead of running pre-update, main-update and post-update across the entire 3D space at once, as in:

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            pre_update(x, y, z)
        }
    }
}

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            fdtd_kernel(x, y, z)
        }
    }
}

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            post_update(x, y, z)
        }
    }
}

It can perhaps be performed in a piecewise manner, allowing the main FDTD kernel and the plugins can reuse the data in CPU cache.

int x_start[], x_end[];
int y_start[], y_end[];
int z_start[], z_end[];
int domain_len;

for (int domain = 0; domain < domain_len; domain++) {
    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                pre_update(x, y, z)
            }
        }
    }

    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                fdtd_kernel(x, y, z)
            }
        }
    }
    for (int x = x_start[domain] x < x_end[domain]; x++) {
        for (int y = y_start[domain]; y < y_end[domain]; y++) {
            for (int z = z_start[domain]; z < z_end[domain]; z++) {
                post_update(x, y, z)
            }
        }
    }
}

Another possible solution is allowing a plugin to insert itself dynamically into main field update loop of the main kernel, rather than using its own redundant loop.

for (int x = 0; x < x_max; x++) {
    for (int y = 0; y < y_max; y++) {
        for (int z = 0; z < z_max; z++) {
            if (has_extension(x, y, z)) {
                pre_update(x, y, z);
                fdtd_kernel(x, y, z);
                post_update(x, y, z);
            }
            else {
                fdtd_kernel(x, y, z)
            }
        }
    }
}

Modularity can still be somewhat preserved by using function pointers instead of hardcoding extension into the main engine.

Both solutions also have the problem of losing the ability to randomly access a coherent electromagnetic field at any point in space from a plugin, so it cannot be applied to all extensions. However, it can perhaps at least be a fast path for plugins that don't need that, like PML.

Another potential problem is introducing either extra pointer chasing or branching into the engine kernel, which may incur a performance hit. However, as a kernel has a fairy consistent pattern during execution, which should benefit from prefetching and branching prediction in modern CPUs, it's hoped that the ultimate performance gain is still a net positive due to reduced memory traffic, in spite of the slight performance hit from pointer chasing or branching.

biergaizi commented 1 year ago

I just updated the text with additional benchmark results on the AMD Ryzen 9 7950X3D (with 3D V-Cache) and on the Apple M1 Ultra (with 800 GB/s memory bandwidth with ~400 GB/s usable from the CPU). The test was finished a month ago but I just remembered to publish it. The conclusion is disappointing. Both machines failed to provide any real performance benefits, likely due to bottlenecks within openEMS. This was my motivation to start working on spatial and temporal tiling.

In the future, with loop tiling implemented, I expect much greater benefits on these platforms.

CC @0xCoto

thliebig commented 9 months ago

@biergaizi can you please confirm that this can/should be merged? After #101 I guess? Thanks again for your work on this!

biergaizi commented 9 months ago

I have changed my mind, my new plan is to stop working on this minor change with merely a 20% speedup in favor of merging the more powerful tiling engine with a larger speedup into the upstream in the future.

thliebig commented 9 months ago

ah ok, good that I asked...