Hybrid OpenMP-MPI implementation Strategy + Vectorization (SIMD)

Preamble

I am moving the discussion about SIMD that started in #716 here and adding hybrid parallelization. The two topics go hand in hand since both (SPMD and SIMD) consist of processing multiple data (MD) elements simultaneous, either by a single program (SP) that is run by multiple threads (generally with shared view of memory), or by a single instruction (SI) run by a single core. The reason SIMD came up in #716 is that, as I will demonstrate, vectorization needs to be supported by data structures. On the other hand SPMD needs to be supported by algorithms designed to avoid race conditions, two or more threads modifying the same memory location.

Instead of continuing #716 I think it is better to let that become documentation for #753. I did not add without loss of readability to the title because it is long as is, that requirement is present nonetheless. I open these issues in the hope that people participate (I am not a fast writer so this is actually a lot of work) and so far great comments and insights have come from those with experience in these topics (kudos to @economon and @vdweide). But please participate even if you never heard of these topics, your opinion about readability and "developability" of the code is important! I think the code-style should be accessible to people starting a PhD (after they read a bit about C++...).

My (ambitious) goal with this work is to lay down an architecture for performance, i.e. not just to improve the performance of a few key numerical schemes but to create mechanisms applicable to all existing and future ones. Moreover I want that to be possible with minimal changes to the way those bits of code are currently written.

Intro to SIMD

The ALU of modern CPU are capable of processing multiple elements of built-in types simultaneously by applying one instruction (e.g. add) to a register of those elements. Registers are at the very top of the memory hierarchy, for any computation to be performed data needs to be in registers. An AVX register is 256 bits wide, that means 4 lanes of doubles or 8 of floats, AVX-512 (available in Xeon-Phi and SkylakeX processors) doubles the size. By GPU standards these are rookie numbers.

Why should we care about SIMD? Because it is the only way to use the whole silicon, by and large vector instructions have the same latency and throughput of their scalar versions, therefore speedups proportional to the number of SIMD lanes are possible in compute-bound code. As we saw in #716 there is some of that in the numerics, do not expect 4x speed-ups though, low order unstructured FVM is known to be bandwidth-bound, vectorization helps a bit there too (instructions are also data that needs to travel to the CPU) (maybe for explicit schemes and 8 SIMD lanes, maybe).

Relation with data structures There is only one efficient way to move data between memory and registers, via load and store instructions (they do come in multiple flavors). That is, pointing to a memory location and reading or writing N elements of contiguous data. It is not the only way, it is also possible to gather and scatter data. That is populating the register from non-contiguous locations and vice versa. This is about one order of magnitude slower, to the point where if the computations are very simple it may not pay-off to vectorize.

Relation with algorithms Some form of gather and scatter is required in unstructured CFD, which means SIMD has a price of admission. Some thought needs to go into designing algorithms that amortize that cost by maximizing the so called FLOP/Byte ratio, and mask the latency of those operations by being able to start computing as soon as the first element of data is available.

What elements should we try to process simultaneously? The choice is between multiple geometric primitives (edges/points) or multiple solution primitives (variables). The latter sounds like a sensible idea until we get to areas of the code where different primitives require different treatment, that and the fact that the number of variables might not fit evenly in the number of lanes can lead to very tricky and non-generic code. Nevertheless if the same code were to be applied to e.g. 4 solution variables, this strategy would likely perform better as it avoids the pesky gather/scatter operations. Processing multiple geometric primitives can make full utilization of whatever register size (important on GPU's), the code is just as readable (as I hope to show), but gather/scatter cannot be avoided.

Intro to SPMD

This one is simpler, in a nutshell multiple threads operate on the sub domain of an MPI rank. The typical implementation has each thread executing a chunk of an edge or cell loop.

Why should we care about SPMD? Reduce the communication overhead resulting from domain decomposition and improve load balancing, important for strong scaling. Some algorithms are more efficient that way, e.g. the ADT (as mentioned by Edwin), the current MG also seems to work better on fewer partitions, and additive versions of preconditioners like the ILU or LU-SGS lose effectiveness with number of partitions. Optimum hardware utilization, for routines that are bandwidth-bound it may be beneficial to use all threads available, while for compute-bound or "algorithm-bound" ones this may not be the case.

Relation with algorithms A typical edge loop reads from 2 locations and writes to 2 locations (gather / scatter access pattern, not to be confused with the instructions) processing multiple edges at the same time can therefore result in race conditions where multiple threads try to update the data of the same point. There are 3 ways to address this:

Coloring: Edges are colored (grouped) such that edges of the same color have no risk of race conditions, i.e. each endpoint is referenced only once per color (this definition gives you the basis of a greedy algorithm to color edges).
Scatter to gather transformations: Edge quantities (e.g. fluxes) are computed and stored on one pass (i.e. we read from 2 locations and write to 1), on a second pass, over points, we reduce (e.g. sum) the edge quantities for each point, again a gather access pattern. It may also be possible to convert the entire algorithm to a loop over points instead of edges.
Atomic operations or locks: Here when a thread wants to write to a memory location it either needs to do so atomically (this is essentially an operations that always goes through main memory and forces cache coherency) or it needs to acquire a lock for the point it is writing to, if it fails to acquire the lock (because another thread has it) it needs to wait.

None of these is without drawbacks.

Coloring reduces temporal locality, edges are sorted in increasing order of the point indices to reduce cache misses, this means small groups of contiguous edges will share the same "iPoint", coloring single edges destroys this. Furthermore coloring either requires edges to be re-sorted by color, or if the edge indices of each color are instead kept in arrays, performance will suffer due to increased indirection which confounds the hardware pre-fetcher. This can be mitigated to some extent by coloring groups of edges, groups of edges of the same color can be processed simultaneously, but within each group edges need to be processed serially. However grouping will reduce how much parallelism can be exploited within each color.
Gather to scatter will in general use more memory due to the intermediate variables and extra adjacency information needed. If the entire algorithm is transformed performance may suffer as some computations may have to be repeated. However, some reductions are possible without intermediate variables, for example when assembling the system matrix for implicit schemes only the diagonal entries can result in race conditions, now it just so happens that each diagonal entry is equal to the negated corresponding column sum.
Atomics are terrible for the performance of code that writes frequently to memory (i.e. bandwidth-bound code), they do not increase the memory footprint and so make sense for compute-bound code. Bugs due to a missing atomic can be very hard to debug (but any race condition is).

Coloring is what one sees most in the literature, and yet I lean towards gather-to-scatter. Fewer things can go wrong with it as it is easy to understand, one gets the maximum amount of parallelism.

I will now take two familiar routines, computing gradients (Green-Gauss) and limiters, vectorize / parallelize them in different ways, and measure relative performance to illustrate some of these key points introduced here. There will be C++ snipets and there will be some x86 assembly too :)

Disclaimer The performance numbers that follow are based on simple implementations of the methods, I do not claim any of my implementations or choice of methods to be optimal. If you know better speak up. The data is from the case used to benchmark #753 (see #716), it is by no means an extensive collection of different grid types. I will share code and data with anyone who wants to repeat the tests on the condition they post detailed results.

With that out of the way :) ...

Green-Gauss Gradients

This is the plain edge-loop version of the code with boundary contributions omitted for simplicity:

void computeGradients(size_t nEdge,
                      size_t nPoint,
                      size_t nVar,
                      size_t nDim,
                      const vector<pair<size_t,size_t> >& connectivity,
                      const Matrix& area,
                      const vector<double>& volume,
                      const Matrix& phi,
                      VectorOfMatrix& grad)
{
    grad.setZero();

    for(size_t iEdge=0; iEdge<nEdge; ++iEdge)
    {
        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            double phi_ave = 0.5*(phi(iPoint,iVar)+phi(jPoint,iVar));

            for(size_t iDim=0; iDim<nDim; ++iDim)
            {
                double flux = phi_ave*area(iEdge,iDim);

                grad(iPoint,iVar,iDim) += flux;
                grad(jPoint,iVar,iDim) -= flux;
            }
        }
    }

    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
      for(size_t iVar=0; iVar<nVar; ++iVar)
        for(size_t iDim=0; iDim<nDim; ++iDim)
          grad(iPoint,iVar,iDim) /= volume[iPoint];
}

This is more or less what SU2 does with minor differences on how the edges (connectivity) and area are stored, there is no vectorization nor easy way to make the loop parallel, this will be the reference for execution times.

Suppose now that due to a perfect storm the number of variables is 4, here is how with a few pragmas we get gcc to vectorize:

template<size_t nVar>
void computeGradients_impl(size_t nEdge,
                           size_t nPoint,
                           size_t nDim,
                           const vector<pair<size_t,size_t> >& connectivity,
                           const Matrix& area,
                           const vector<double>& volume,
                           const Matrix& phi,
                           VectorOfMatrix& grad)
{
    grad.setZero();

    for(size_t iEdge=0; iEdge<nEdge; ++iEdge)
    {
        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        double phi_ave[nVar];

        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
          phi_ave[iVar] = 0.5*(phi(iPoint,iVar)+phi(jPoint,iVar));

        for(size_t iDim=0; iDim<nDim; ++iDim)
        {
            #pragma omp simd
            for(size_t iVar=0; iVar<nVar; ++iVar)
            {
                double flux = phi_ave[iVar]*area(iEdge,iDim);

                grad(iPoint,iVar,iDim) += flux;
                grad(jPoint,iVar,iDim) -= flux;
            }
        }
    }

    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
      for(size_t iDim=0; iDim<nDim; ++iDim)
        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
          grad(iPoint,iVar,iDim) /= volume[iPoint];
}

Well it is not just a few pragmas, we need to make the number of variables known at compile time (via a template parameter) and we need to transpose how the gradient is stored, i.e. instead of {xyz, xyz, xyz, xyz} we need {xxxx, yyyy, zzzz}. This code gets a speed-up of 2.2.

This code is generic but the template needs to be instantiated for every possible number of variables and we need a switch to call the right version at runtime, not very friendly. Processing multiple edges at the same time is not worth the effort, for one we need gather/scatter on a very light routine, and on top of that we need to sort the edges such that we do not attempt to scatter to the same point when updating the gradient (a problem similar to the race condition described for SPMD).

We can switch to a point-based loop and process multiple points in a SIMD way, that avoids the scatter problem but gathers will still be required. Here is what the scalar version of the point-based loop looks like:

void computeGradients(size_t nPoint,
                      size_t nVar,
                      size_t nDim,
                      const Adjacency& adj,
                      const Matrix& area,
                      const vector<double>& volume,
                      const Matrix& phi,
                      VectorOfMatrix& grad)
{
    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
    {
        for(size_t iVar=0; iVar<nVar; ++iVar)
          for(size_t iDim=0; iDim<nDim; ++iDim)
            grad(iPoint,iVar,iDim) = 0.0;

        for(size_t iNeigh=0; iNeigh<adj.nNeighbor(iPoint); ++iNeigh)
        {
            size_t jPoint = adj.jPoint(iPoint,iNeigh);
            size_t iEdge = adj.iEdge(iPoint,iNeigh);
            double dir = adj.dir(iPoint,iNeigh);

            for(size_t iVar=0; iVar<nVar; ++iVar)
            {
                double phi_ave = 0.5*(phi(iPoint,iVar)+phi(jPoint,iVar));

                for(size_t iDim=0; iDim<nDim; ++iDim)
                  grad(iPoint,iVar,iDim) += phi_ave*dir*area(iEdge,iDim);
            }
        }

        for(size_t iVar=0; iVar<nVar; ++iVar)
          for(size_t iDim=0; iDim<nDim; ++iDim)
            grad(iPoint,iVar,iDim) /= volume[iPoint];
    }
}

The Adjacency class stores for each point: the surrounding neighbor points (this is available in SU2), the neighbor edges, and the direction (in or out, -1 or 1) of the area vector relative to the point. The speedup is 0.83 (i.e. not a speedup), that is actually not that bad considering the same computation is repeated for each edge, the reason it is not that bad is the sequential access to the gradient. Note that this loop is one #pragma away from parallelization.

The SIMD version of this code is:

void computeGradients(size_t nPoint,
                      size_t nVar,
                      size_t nDim,
                      const Adjacency<4>& adj,
                      const Matrix& area,
                      const Vector& volume,
                      const Matrix& phi,
                      VectorOfMatrix& grad)
{
    const size_t SIMDLEN = 4;

    for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN)
    {
        for(size_t iVar=0; iVar<nVar; ++iVar)
          for(size_t iDim=0; iDim<nDim; ++iDim)
            grad.setVec(iPoint,iVar,iDim,Array<double,SIMDLEN>(0.0));

        for(size_t iNeigh=0; iNeigh<adj.nNeighbor_vec(iPoint); ++iNeigh)
        {
            auto jPoint = adj.jPoint_vec(iPoint,iNeigh);
            auto iEdge = adj.iEdge_vec(iPoint,iNeigh);
            auto dir = adj.dir_vec(iPoint,iNeigh);

            for(size_t iVar=0; iVar<nVar; ++iVar)
            {
                auto phi_ave = (phi.getVec(iPoint,iVar)+
                                phi.getVec(jPoint,iVar))*0.5;

                for(size_t iDim=0; iDim<nDim; ++iDim)
                  grad.addVec(iPoint,iVar,iDim,
                              phi_ave*dir*area.getVec(iEdge,iDim));
            }
        }

        for(size_t iVar=0; iVar<nVar; ++iVar)
          for(size_t iDim=0; iDim<nDim; ++iDim)
            grad.setVec(iPoint,iVar,iDim,
                        grad.getVec(iPoint,iVar,iDim)/volume.getVec(iPoint));
    }
}

I think this is just as readable especially considering that in SU2 we always need to use some Set/Get/Add/Sub method to update a variable, the difference is that here those methods have overloads to operate on small fixed size vectors. The speedup is 1.35 (i.e. 35% faster than edge-based reference) note that the improvement relative to scalar-point-based is only 1.6, those pesky gathers...

The loop advances SIMDLEN points on each iteration, yet there are no pragmas and small simd-loops in sight, in good C++ fashion that trickery has been encapsulated in a "simd-friendly" class. Such a class can look something like this:

template<class T, size_t N>
class Array
{
#define FOREACH for(size_t k=0; k<N; ++k)
  public:
    enum : size_t {Size = N};
    enum : size_t {Align = N*sizeof(T)};
  private:
    // fixed size and aligned array of internal data, naturally maps to a SIMD register
    alignas(Align) T vals_[N];
    /*
     * Some helper methods go here
     */
  public:
    // **** CONSTRUCTORS **** //
    // We want to be able to construct this type from single scalars,
    // a memory location from which we LOAD data,
    // or a memory location and some offsets from which we GATHER data.
    // In addition to the "normal" constructors.

    // scalar broadcasting ctor
    STRONGINLINE Array(T x) {bcast(x);}

    // loading ctor
    STRONGINLINE Array(const T* ptr)
    {
        #pragma omp simd aligned(ptr:Align)
        FOREACH vals_[k] = ptr[k];
    }
    // gathering ctor
    template<class U>
    STRONGINLINE Array(const T* base_ptr, const U& offsets)
    {
        #pragma omp simd
        FOREACH vals_[k] = base_ptr[offsets[k]];
    }
    /*
     * Other traditional constructors (default, copy-ctor, move-ctor, etc) go here
     */

    // **** ACCESSORS **** //
    STRONGINLINE T& operator[] (size_t k) {return vals_[k];}
    STRONGINLINE T operator[] (size_t k) const {return vals_[k];}

    // **** MATH OPERATORS **** //
    STRONGINLINE Array& operator= (const Array& rhs)
    {
        #pragma omp simd
        FOREACH vals_[k] = rhs.vals_[k];
        return *this;
    }

    STRONGINLINE Array& operator+= (const Array& rhs)
    {
        #pragma omp simd
        FOREACH vals_[k] += rhs.vals_[k];
        return *this;
    }
    STRONGINLINE Array operator+ (const Array& rhs) const { return Array(*this)+=rhs; }

    /*
     * Many other operators go here.
     */
};

// Common math function overloads
template<class T>
STRONGINLINE T vmax(const T& a, const T& b)
{
    T res;
    #pragma omp simd
    for(size_t k=0; k<T::Size; ++k)
        res[k] = (a[k]>b[k])? a[k] : b[k];
    return res;
}

#undef FOREACH

There are other (better) ways to do this, for example using x86 intrinsics (in header <x86intrin.h>), register types instead of arrays, and a boat load of template meta-programming (I'm guessing) there are professional libraries for this. This quickly-hacked-together code is compatible with custom types, portable, and seems to do the trick.

To pull this off we do not need to have Vector or Matrix of this class, the underlying type for those data structures is still double, only the getVec type methods need to convert on the fly to the SIMD type, for example:

// use the "pointer ctor" to return an array starting at "row0"
Array<double,4> Matrix<double>::getVec(size_t row0, size_t col) const {
    return Array<double,4>(&data_[row0+col*rows_]);
}

// use the "gather ctor" to return an array with the indices in "rows"
template<class U>
Array<double,4> Matrix<double>::getVec(const U& rows, size_t col) const {
    return Array<double,4>(&data_[col*rows_], rows);
}

After inlining those copies get optimized away. Although the stored type, and "scalar interface" of the containers do not need to change, the storage order of the data does. Notice that in the above data is stored by columns instead of rows (something that @vdweide mentioned in #716) this has greater implications for gradients as instead of the familiar "vector of matrices" we would need a "matrix of vectors", i.e. the derivative of variable i w.r.t. coordinate j stored as a vector for all points.

The Adjacency also needs to be stored in a funny way. For the scalar version of the code it was stored as a CSR sparse matrix (one array of indices into the arrays of data for each point, the rows). For the vectorized version we want to load (small) arrays of jPoint's, arrays of iEdge's, and arrays of directions, and as we know either those are contiguous or we take a huge performance hit. If all points had the same number of neighbors we could store the adjacency in LIL (list of lists) format, essentially a column-major matrix, but that is not true for hybrid meshes and so we would possibly waste a lot of memory. The solution is to use a Block-CSR format (like in CSysMatrix) where the blocks are the vectors we want and instead of one row per point we have one row per SIMD group.

But even within a SIMD-sized group points can have different number of neighbors... The solution for that is padding, within each group the number of neighbors is rounded up, shorter rows are then padded with valid data, e.g. jPoint=iPoint, direction=0, and iEdge repeated.

This concept of padding is important for something else, you may have noticed that the SIMD point-loops I showed make no provisions for values of nPoint that are not multiples of SIMDLEN, that is because the containers already took care of that by rounding up the number of columns, and so that seemingly out-of-bounds access is safe (ain't encapsulation great). Padding also aligns the start of each column, thus it is a generally good thing to have (on large dimensions) whether used or not.

Here is a relative performance recap before we talk bout parallelization

Code	Edge	Edge, SIMD on vars	Point	Point, SIMD on points
Speed	1.0	2.2	0.83	1.35

Parallel execution

I will start at the end for this, all it takes to parallellize the points loops with OpenMP is to take this:

for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN)

And add some pixie dust

#pragma omp parallel for schedule(dynamic,128)
for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN)

This means each thread gets chunks of 128 loop iterations (512 points) to work on, assigned in a dynamic way, the 4 core speedup (still relative to our reference) is 3.8 for the SIMD code and 2.8 for the scalar code.

Parallelizing the edge loops is a bit more intricate, as this:

    for(size_t iEdge=0; iEdge<nEdge; ++iEdge)
    {
        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

Becomes:

    for(size_t color=0; color<colorStart.size()-1; ++color)
    #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
    for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
    {
        #if SORT_BY_COLOR==1
            size_t iEdge = k;
        #else
            size_t iEdge = edgeIdx[k];
        #endif

        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

Apologies for the macro but it is just to illustrate that if we re-sort edge data after coloring the edge index is the loop index, otherwise the edge indices for each color need to be stored in a separate array. Note that for each edge loop we first loop over colors, then over same-color edges, it is this inner loop that can run in parallel in chunk sizes that are multiple of the group size considered during coloring. There is some runtime cost on entry to every #omp parallel section, with coloring we enter one such section once by color.

I mentioned in the introduction coloring reduces locality and therefore performance, here is the effect of color group size on the execution time of the scalar code on one thread: The hassle-free option of not sorting by color "never" recovers the performance of the base algorithm, things are even worse for the SIMD version where even at group size of 8192 with re-sorting the slowdown is 14%.

Running the edge-loop version on 4 cores (8192 group + sorting) we get speedups (relative to reference) of 1.98 and 2.04 for the scalar and SIMD versions respectively (yes I quadruple checked). If you are keeping track of the number two things should surprise you, the first is that there is no difference between scalar and SIMD now (the vector instruction are still there though), the second is that 4 cores give only a 2x speedup. The reason for both is: the implementation is very memory-bound, and so throwing more compute at it, either in the form of more cores or more lanes, does not help much.

This is the 4 core summary:

Code	Edge	Edge, SIMD on vars	Point	Point, SIMD on points
Speed	2.0	2.0	3.8	2.8

I think the point-based versions scale better because they are a bit less memory-bound as they write to the gradient sequentially and they have a bit more compute due to the duplicated computations.

Conclusion Computing gradients via point-loops allows simpler and more generic SIMD and SPMD strategies, the resulting implementation seems to do better in the bandwidth-starved conditions typical of modern hardware (3 or more cores per memory channel). However, additional adjacency information is required to support point-based loops.

Next I will talk about limiters, almost all concepts are introduced so it will be shorter (promise). As a little appetizer let me tell you we can recover the extra memory and we could be looking at a 2.7x speedup for gradients+limiters.

Nice progress @pcarruscag!

I like the concept of your SIMD-friendly class that will take care of the data structure under the hood coupled with a standard type of loop statement (w/ +SIMDLEN). This should make it pretty easy for folks to still modify the kernels without having to worry about the data alignment, and they can reuse the same simple 'for' construct repeatedly.

Another reason to have our own lightweight class for this is that you can avoid dependence on OpenMP for SIMD (although that feature looks to have potential and wasn't available until somewhat recently) as well as the intrinsics. In my experience, the latter is especially bad for portability and readability (part of why we left the CaF work in a separate repo). It starts to become so specialized that compiling and modifying become difficult. W.r.t. OpenMP, another roadblock there a few years ago was making sure it is interoperable with CoDi for the adjoint, but I know this has been worked on and may be available by now.

Might keep an open mind about point vs. edge. In some places, we may be able to pump up the compute in our loops by fusing kernels, as previously discussed (and I am guessing you are working on this already with gradients/limiters). Could change the final performance numbers significantly.

Lastly, I know you are not there yet, but it is worth considering whether you can reuse anything you are developing in the kernels here for the linear solver routines. At some point, you will successfully reduce the cost of the residual kernels (RHS) to the bandwidth limit, and the majority of the iteration cost will be in the linear solver (it is already about 50% of the iteration cost before optimization, if I recall). Before making final decisions on strategy, you should consider if it will help in any of the linear solver routines too.

Thanks @economon! I don't know what is the current situation with OpenMP and CoDi but any eventual change will have to be compatible with CoDi. The worst case would be disabling OpenMP for the discrete adjoint, the parallel clause supports an "if" modifier so that would not be too hard. But I hope to at least be able to lower the memory footprint by fusing some loops or make pre-accumulation more effective by using point loops. The linear solvers will indeed become the bottleneck, they already are for JST, the good thing is matrix multiplication is easier to vectorize, not sure the best strategy will be similar though.

Limiters

Scalar (reference) version of the code:

void computeLimiters(size_t nPoint,
                     size_t nVar,
                     size_t nDim,
                     const vector<pair<size_t,size_t> >& connectivity,
                     const Matrix& coords,
                     const Matrix& phi,
                     const VectorOfMatrix& grad,
                     Matrix& phiMax,
                     Matrix& phiMin,
                     Matrix& limiter)
{
    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
    {
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            phiMax(iPoint,iVar) = phi(iPoint,iVar);
            phiMin(iPoint,iVar) = phi(iPoint,iVar);
            limiter(iPoint,iVar) = 2.0;
        }
    }

    for(auto edge : connectivity)
    {
        size_t iPoint = edge.first;
        size_t jPoint = edge.second;

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            phiMax(iPoint,iVar) = max(phiMax(iPoint,iVar), phi(jPoint,iVar));
            phiMin(iPoint,iVar) = min(phiMin(iPoint,iVar), phi(jPoint,iVar));

            phiMax(jPoint,iVar) = max(phiMax(jPoint,iVar), phi(iPoint,iVar));
            phiMin(jPoint,iVar) = min(phiMin(jPoint,iVar), phi(iPoint,iVar));
        }
    }

    for(auto edge : connectivity)
    {
        size_t iPoint = edge.first;
        size_t jPoint = edge.second;

        double d_ij[3] = {0.0, 0.0, 0.0};

        for(size_t iDim=0; iDim<nDim; ++iDim)
          d_ij[iDim] = 0.5*(coords(jPoint,iDim)-coords(iPoint,iDim));

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            double proj_i = 0.0, proj_j = 0.0;

            for(size_t iDim=0; iDim<nDim; ++iDim)
            {
                proj_i += d_ij[iDim]*grad(iPoint,iVar,iDim);
                proj_j -= d_ij[iDim]*grad(jPoint,iVar,iDim);
            }

            double lim_i = phiMax(iPoint,iVar);
            double lim_j = phiMax(jPoint,iVar);

            const double eps = numeric_limits<double>::epsilon();

            if(proj_i <= 0.0)
            {
                lim_i = phiMin(iPoint,iVar);
                proj_i = min(proj_i, -eps);
            }

            if(proj_j <= 0.0)
            {
                lim_j = phiMin(jPoint,iVar);
                proj_j = min(proj_j, -eps);
            }

            lim_i = (lim_i-phi(iPoint,iVar))/proj_i;
            limiter(iPoint,iVar) = min(limiter(iPoint,iVar), lim_i);

            lim_j = (lim_j-phi(jPoint,iVar))/proj_j;
            limiter(jPoint,iVar) = min(limiter(jPoint,iVar), lim_j);
        }
    }

    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
    {
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            double lim = limiter(iPoint,iVar);
            limiter(iPoint,iVar) = lim*(lim+2)/(lim*lim+lim+2);
        }
    }
}

Something in the code above is a bit different from the implementation in SU2, namely:

double lim_i = phiMax(iPoint,iVar);
if(proj_i <= 0.0) {
    lim_i = phiMin(iPoint,iVar);
    proj_i = min(proj_i, -eps);
}

This is the bit of code that selects the right delta based on the sign of the projection and avoids division by zero, this less readable version does the same with one branch instead of three, simplifying "if" statements is essential for vectorization, so to make the comparison fair I used the same strategy in the scalar code.

To make this post shorter I will show the SIMD and parallel version of the code right away. Trying to process multiple edges instead of multiple variables has all the problems I mentioned for the gradients, so again we use the trick of templating on the number of variables.

template<size_t nVar>
void computeLimiters_impl(size_t nPoint,
                          size_t nDim,
                          const vector<size_t>& colorStart,
                          const vector<size_t>& edgeIdx,
                          const vector<pair<size_t,size_t> >& connectivity,
                          const Matrix& coords,
                          const Matrix& phi,
                          const VectorOfMatrix& grad,
                          Matrix& phiMax,
                          Matrix& phiMin,
                          Matrix& limiter)
{
    // initialize
    #pragma omp parallel for schedule(dynamic,TARGET_CHUNK_SIZE)
    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
    {
        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            phiMax(iPoint,iVar) = phi(iPoint,iVar);
            phiMin(iPoint,iVar) = phi(iPoint,iVar);
            limiter(iPoint,iVar) = 2.0;
        }
    }

    // find min and max neighbor
    for(size_t color=0; color<colorStart.size()-1; ++color)
    #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
    for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
    {
        #if SORT_BY_COLOR==1
            size_t iEdge = k;
        #else
            size_t iEdge = edgeIdx[k];
        #endif

        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        // some hand-holding needed for simd min/max with gcc,
        // one of the min/max operands needs to be on the stack
        // (so the compiler knows the two do not overlap?)
        double phi_i[nVar], phi_j[nVar];

        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            phi_i[iVar] = phi(iPoint,iVar);
            phi_j[iVar] = phi(jPoint,iVar);
        }

        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            phiMax(iPoint,iVar) = max(phiMax(iPoint,iVar), phi_j[iVar]);
            phiMin(iPoint,iVar) = min(phiMin(iPoint,iVar), phi_j[iVar]);
            phiMax(jPoint,iVar) = max(phiMax(jPoint,iVar), phi_i[iVar]);
            phiMin(jPoint,iVar) = min(phiMin(jPoint,iVar), phi_i[iVar]);
        }
    }

    for(size_t color=0; color<colorStart.size()-1; ++color)
    #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
    for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
    {
        #if SORT_BY_COLOR==1
            size_t iEdge = k;
        #else
            size_t iEdge = edgeIdx[k];
        #endif

        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        // i to j vector
        double d_ij[3] = {0.0, 0.0, 0.0};

        for(size_t iDim=0; iDim<nDim; ++iDim)
          d_ij[iDim] = 0.5*(coords(jPoint,iDim)-coords(iPoint,iDim));

        // projections
        double proj_i[nVar], proj_j[nVar];

        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
          proj_i[iVar] = proj_j[iVar] = 0.0;

        for(size_t iDim=0; iDim<nDim; ++iDim)
        {
            #pragma omp simd
            for(size_t iVar=0; iVar<nVar; ++iVar)
            {
                proj_i[iVar] += d_ij[iDim]*grad(iPoint,iVar,iDim);
                proj_j[iVar] -= d_ij[iDim]*grad(jPoint,iVar,iDim);
            }
        }

        // choose the "right" delta based on sign of projection
        // and avoid division by zero
        double lim_i[nVar], lim_j[nVar];

        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            lim_i[iVar] = phiMax(iPoint,iVar);
            lim_j[iVar] = phiMax(jPoint,iVar);
        }

        const double eps = numeric_limits<double>::epsilon();

        // very simple if's are required to get vectorization
        // trough vector comparisons and masked blends
        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            if(proj_i[iVar] <= 0.0)
            {
                lim_i[iVar] = phiMin(iPoint,iVar);
                proj_i[iVar] = min(proj_i[iVar], -eps);
            }

            if(proj_j[iVar] <= 0.0)
            {
                lim_j[iVar] = phiMin(jPoint,iVar);
                proj_j[iVar] = min(proj_j[iVar], -eps);
            }
        }

        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            lim_i[iVar] = (lim_i[iVar]-phi(iPoint,iVar))/proj_i[iVar];
            limiter(iPoint,iVar) = min(limiter(iPoint,iVar), lim_i[iVar]);

            lim_j[iVar] = (lim_j[iVar]-phi(jPoint,iVar))/proj_j[iVar];
            limiter(jPoint,iVar) = min(limiter(jPoint,iVar), lim_j[iVar]);
        }
    }

    #pragma omp parallel for schedule(dynamic,TARGET_CHUNK_SIZE)
    for(size_t iPoint=0; iPoint<nPoint; ++iPoint)
    {
        #pragma omp simd
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            double lim = limiter(iPoint,iVar);
            limiter(iPoint,iVar) = lim*(lim+2)/(lim*lim+lim+2);
        }
    }
}

Again to keep things short here is the parallel and SIMD point-loop version (like for gradients it is very similar to the scalar and sequential version).

void computeLimiters(size_t nPoint,
                     size_t nVar,
                     size_t nDim,
                     const Adjacency<4>& adj,
                     const Matrix& coords,
                     const Matrix& phi,
                     const VectorOfMatrix& grad,
                     Matrix& limiter)
{
    const size_t SIMDLEN = 4;
    using FltVec = Array<double,SIMDLEN>;

    // working variables
    FltVec phiMax[MAXNVAR], phiMin[MAXNVAR], prjMax[MAXNVAR], prjMin[MAXNVAR];

    const double eps = numeric_limits<double>::epsilon();

    #pragma omp parallel for schedule(dynamic,128) private(phiMax,phiMin,prjMax,prjMin)
    for(size_t iPoint=0; iPoint<nPoint; iPoint+=SIMDLEN)
    {
        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            phiMin[iVar] = phiMax[iVar] = phi.getVec(iPoint,iVar);
            prjMax[iVar] = eps;
            prjMin[iVar] = -eps;
        }

        for(size_t iNeigh=0; iNeigh<adj.nNeighbor(iPoint); ++iNeigh)
        {
            auto jPoint = adj.jPoint_vec(iPoint,iNeigh);

            FltVec d_ij[3] = {FltVec(0.0), FltVec(0.0), FltVec(0.0)};

            for(size_t iDim=0; iDim<nDim; ++iDim)
              d_ij[iDim] = (coords.getVec(jPoint,iDim)-
                            coords.getVec(iPoint,iDim))*0.5;

            for(size_t iVar=0; iVar<nVar; ++iVar)
            {
                FltVec prj = 0.0;

                for(size_t iDim=0; iDim<nDim; ++iDim)
                  prj += d_ij[iDim]*grad.getVec(iPoint,iVar,iDim);

                prjMax[iVar] = vmax(prjMax[iVar], prj);
                prjMin[iVar] = vmin(prjMin[iVar], prj);

                phiMax[iVar] = vmax(phiMax[iVar], phi.getVec(jPoint,iVar));
                phiMin[iVar] = vmin(phiMin[iVar], phi.getVec(jPoint,iVar));
            }
        }

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            FltVec lim = vmin(FltVec(2.0), vmin(
                         (phiMax[iVar]-phi.getVec(iPoint,iVar))/prjMax[iVar],
                         (phiMin[iVar]-phi.getVec(iPoint,iVar))/prjMin[iVar]));

            limiter.setVec(iPoint,iVar, lim*(lim+2.0)/(lim*lim+lim+2.0));
        }
    }
}

In terms of algorithm, for each point we find the min and max neighbor values and the min (negative) and max (positive) projections, those are then combined in a final min(2, max/max, min/min) to which the limiter function is applied (this would also be applicable to Venkatakrishnan-[Wang] limiters). This is equivalent to the edge-loop, if statements are not required as due to cells being closed, if the positive projection is not zero, the negative one will also not be zero, therefore it is correct to always evaluate both ratios. This algorithm only needs min and max neighbors as small local variables instead of large global ones due to the way those values are determined. This is where the memory from the extra adjacency information is recovered.

Like @economon said, fusing the gradient kernel with the limiter kernel is trivial with these point loops, and I do not think it affects readability much since one can clearly tell "what is what" (I will not put it here but it really is a matter of copy paste), including the boundaries could be a bit more challenging, but I will give performance number nevertheless.

Performance summary

Code	Edge	Edge, SIMD on vars	Point	Point, SIMD on points
Speed 1 core	1.0	1.75	1.25	2.0
Speed 4 cores	2.45	2.7	4.5	7.0

The basic point version does not lose to edge based because, contrary to gradients, it does not require duplication of computations while benefiting from sequential access to gradients. Again the point-based implementation does really well in parallel, limiters are more compute intensive and so the scaling is almost perfect. For reference, limiters are 1.9 times more expensive to compute than gradients with the reference edge version. With point loops, SIMD, and in parallel, gradients and limiters cost the same.

If we consider the combined cost of gradients and limiters, and compare the scalar "edge+edge" with the SIMD "point+point" and "fused point" we get:

G+L Approach	Edge+Edge	Point+Point	Fused Point
Speed 1 core	1.0	1.75	1.85
Speed 4 cores	2.3	5.35	6.1

Fusing point loops only gives a 14% improvement vs separate loops due to the difference in gathered data, only one gather is amortized and the remaining memory accesses are very efficient. Nevertheless if it can be done nicely while accounting for boundaries (which may have to be handled outside the loop) it could allow some memory savings for the discrete adjoint.

Thank you for the very rigorous work. May be it would be nice to keep the possibility to do async MPI-I/O later. For reference, people from intel explain a clever strategy in a recently published paper "Asynchronous and multithreaded communications on irregular applications using vectorized divide and conquer approach,". They were using this library https://github.com/EXAPARS/DC-lib to achieve at least a 3% gain.

Thanks @MicK7 I will have a look, my initial thought was to have a simple strategy where within each MPI rank parallelism is extracted via colouring or scatter-to-gather transformations and only one thread per rank participates in the message passing, I have no experience here though so this might be a bad strategy, idk.

Back to business: I went silent for a bit because in prototyping a typical residual computation and matrix update loop I made some realisations that made me go back to the drawing board regarding data structures, and eventually back to square 0.

Parallel strategy for flux computation

Because significant computation is required to obtain each edge's flux, it does not make sense to attempt a "point-loop" strategy (which would double the effort). However, one can either use colouring to avoid the race conditions that would result from updating the residual of cells i and j, or store the edge fluxes and then, on a second point-loop perform the summation of fluxes for each cell, with the direction being accounted by the same adjacency information used in the point-loop GG gradient computation. If we consider only the update of residuals the two strategies are fairly equivalent performance wise, the tie breaker is the matrix updates.

Matrix Updates

By this I mean the addBlock, subBlock we do (two times each) to update diagonal and off-diagonal blocks for each edge. Here is a dummy numerics loop that does nothing else but setting blocks in the matrix (with colouring).

void testLoop1(const vector<size_t>& colorStart,
               const vector<size_t>& edgeIdx,
               const vector<pair<size_t,size_t> >& connectivity,
               double** blk_i, double** blk_j,
               SparseMatrix& matrix)
{
    matrix.setZero();

    for(size_t color=0; color<colorStart.size()-1; ++color)
    #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
    for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
    {
        size_t iEdge = edgeIdx[k];
        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        matrix.addBlock(iPoint, iPoint, blk_i);
        matrix.addBlock(iPoint, jPoint, blk_j);

        matrix.subBlock(jPoint, jPoint, blk_j);
        matrix.subBlock(jPoint, iPoint, blk_i);
    }
}

This and a few more memory reads is why we can't have nice things, i.e. massive speedups with vectorization. Believe it or not this loop sets ~75% of the maximum speed at which the residual edge loop can run (bandwidth bottleneck). Don't be sad though, we can make a few things about it better:

We can store the blocks we insert contiguously so the writes can be vectorized (this would be done using a container so that we still have (i,j) access syntax);
On each insertion we have to first look for the block by traversing the colInd (column index) array, we can instead map the diagonal blocks to the corresponding points and the off-diagonal blocks to the edge (remember we insert "by the edge").
We can fuse numerics (possibly using the decorator pattern) so that we write to the matrix only once per iteration, which means we only need to clear the diagonal blocks and not the entire matrix because we can set the off-diagonals instead of updating them.

Assuming these modification our dummy loop becomes

void testLoop2(const vector<size_t>& colorStart,
               const vector<size_t>& edgeIdx,
               const vector<pair<size_t,size_t> >& connectivity,
               const double* blk_i, const double* blk_j,
               SparseMatrix& matrix)
{
    matrix.setDiagZero();

    for(size_t color=0; color<colorStart.size()-1; ++color)
    #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
    for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
    {
        size_t iEdge = edgeIdx[k];
        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        matrix.updateBlocks(iEdge, iPoint, jPoint, blk_i, blk_j);
    }
}

where

STRONGINLINE void SparseMatrix::updateBlocks(size_t edge,
    size_t row, size_t col, const double* blk_i, const double* blk_j)
{
    size_t bii = diagMap[row],  bij = edgeMap[edge].first,
           bjj = diagMap[col],  bji = edgeMap[edge].second;

    #pragma omp simd
    for(size_t k=0; k<blkSz; ++k)
    {
        coeffs[bii+k] += blk_i[k];  coeffs[bij+k] = +blk_j[k];
        coeffs[bji+k] = -blk_i[k];  coeffs[bjj+k] -= blk_j[k];
    }
}

This is 47% faster, which for a memory bound task is massive! Yes, this does increase the memory footprint a bit (makes CSysMatrix 4% larger for a 3D problem) but I can get that back by sharing sparsity patterns and maps across turbulence and bulk flow (I think @talbring was already working on this in the template linear solver branch he had started).

We could also parallelize the matrix updates without colouring by setting only the off-diagonal coefficients and then setting the diagonal entries to the column sum. It turns out that this is worse (by about 10%), maybe if the matrix were symmetric (row sum) but a column sum accesses blocks very far apart. Also we want to interleave compute and load/stores as much as possible to allow the CPU pipelining magic to mask the latency of the latter (even if it looks like you can only write the block after it is computed, CPU's have all kinds of buffers that allow the next loop iteration to begin while data is in flight).

Therefore colouring is the way to go.

Note: With vectorized numerics we insert blocks for 4 or 8 edges into the matrix at a time, the data for those inserts will be in a slightly weird format, which will make SparseMatrix::updateBlocks a bit harder on the eye, more on that later.

MUSCL Reconstruction

The MUSCL reconstruction, characteristic of upwind schemes, is the simplest building block to show the (negative) implications of storing the data as structures of arrays (SoA) on the performance of some operations. Here is the most basic numerics you can think of, reconstruct and average (the dummy matrix loop was to benchmark the writes this is to benchmark the reads)

void computeResidual(size_t nVar,
                     size_t nDim,
                     const vector<size_t>& colorStart,
                     const vector<size_t>& edgeIdx,
                     const vector<pair<size_t,size_t> >& connectivity,
                     const Matrix& coords,
                     const Matrix& phi,
                     const VectorOfMatrix& grad,
                     const Matrix& limiter,
                     Matrix& residual)
{
    residual.setZero();

    for(size_t color=0; color<colorStart.size()-1; ++color)
    #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
    for(size_t k=colorStart[color]; k<colorStart[color+1]; ++k)
    {
        size_t iEdge = edgeIdx[k];
        size_t iPoint = connectivity[iEdge].first;
        size_t jPoint = connectivity[iEdge].second;

        double d_ij[MAXNDIM];
        for(size_t iDim=0; iDim<nDim; ++iDim)
            d_ij[iDim] = 0.5*(coords(jPoint,iDim)-coords(iPoint,iDim));

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            double phiL = phi(iPoint,iVar);
            double phiR = phi(jPoint,iVar);

            for(size_t iDim=0; iDim<nDim; ++iDim)
            {
                phiL += limiter(iPoint,iVar)*grad(iPoint,iVar,iDim)*d_ij[iDim];
                phiR -= limiter(jPoint,iVar)*grad(jPoint,iVar,iDim)*d_ij[iDim];
            }

            double flux = 0.5*(phiL+phiR);

            residual(iPoint,iVar) += flux;
            residual(jPoint,iVar) -= flux;
        }
    }
}

after vectorizing this to handle multiple edges simultaneously with the SIMD-friendly type the core of the loop becomes

        using FltVec = Array<double,SIMDLEN>;
        ...

        FltVec d_ij[MAXNDIM];
        for(size_t iDim=0; iDim<nDim; ++iDim)
            d_ij[iDim] = (coords.getVec(jPoint,iDim)-coords.getVec(iPoint,iDim))*0.5;

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            FltVec phiL = 0.0;
            FltVec phiR = 0.0;

            for(size_t iDim=0; iDim<nDim; ++iDim)
            {
                phiL += grad.getVec(iPoint,iVar,iDim)*d_ij[iDim];
                phiR -= grad.getVec(jPoint,iVar,iDim)*d_ij[iDim];
            }

            phiL = phi.getVec(iPoint,iVar) + limiter.getVec(iPoint,iVar)*phiL;
            phiR = phi.getVec(jPoint,iVar) + limiter.getVec(jPoint,iVar)*phiR;

            FltVec flux = (phiL+phiR)*0.5;

            for(size_t k=0; k<SIMDLEN; ++k) {
                residual(iPoint[k],iVar) += flux[k];
                residual(jPoint[k],iVar) -= flux[k];
            }
        }

Note that at the end of the loop we need to de-swizzle the flux to update the multiple indexes references by iPoint and jPoint, which are now short arrays of integers (this operation can be moved to the container, akin to getVec but I show it here for clarity).

With SoA (aka column major storage) this code is 1.5 times slower than the scalar version.

The reason for that is poor locality (of the spacial variety), as we loop through the number of variables and dimensions we are accessing the data in strides of nPoint, as the contiguous index is the first one so that we can perform vector read/writes when computing gradients and limiters. With the scalar version the data for each point is contiguous which means on the first access we get whatever extra data is on the same cache line for free and subsequent accesses will be hardware prefetched since the stride is small (1 in this case). We lose all this with SoA storage.

If we go back to arrays of structures (AoS, aka row major storage, basically what we have in #753) performance is only 9% worse (the code is identical). Those 9% are mostly due to increased integer arithmetic in the accesses to the data, on each call to getVec we resolve 4/8 row/column pairs into 1D indexes, while this calculation is vectorized, it seems to be less optimizable by compilers, for example this

for(size_t iDim=0; iDim<nDim; ++iDim)
    phiL += grad.getVec(iPoint,iVar,iDim)*d_ij[iDim];

gets compiled into this monstrosity

.L13:
        vpmuludq        ymm0, ymm4, ymm1
        vmovq   xmm15, rax
        vmovapd ymm6, ymm11
        mov     rdx, rax
        vpbroadcastq    ymm15, xmm15
        sal     rdx, 5
        add     rax, 1
        vpaddq  ymm0, ymm0, ymm2
        vpsllq  ymm0, ymm0, 32
        vpaddq  ymm0, ymm5, ymm0
        vmovdqa YMMWORD PTR [rbp-240], ymm0
        vpaddq  ymm0, ymm3, ymm0
        vmovdqa YMMWORD PTR [rbp-208], ymm0
        vpaddq  ymm0, ymm15, ymm0
        vmovdqa YMMWORD PTR [rbp-176], ymm0
        vgatherqpd      ymm15, QWORD PTR [rdi+ymm0*8], ymm6
        vmovapd ymm0, YMMWORD PTR [rsi+rdx]
        vfmadd213pd     ymm0, ymm15, YMMWORD PTR [rbp-336]
        vmovapd YMMWORD PTR [rbp-336], ymm0
        cmp     rbx, rax
        jne     .L13

the meat of which is vgatherqpd (getVec) and vfmadd213pd fused-multiply-add to update phiL, everything else is integer arithmetic which in the scalar version gets factored out of the inner loop so that the resulting assembly looks much simpler:

.L15:
        vmovsd  xmm5, QWORD PTR [rsp-40+rax*8]
        vfmadd231sd     xmm0, xmm5, QWORD PTR [r15+rax*8]
        add     rax, 1
        cmp     rcx, rax
        jne     .L15

I think the reason for this is that there are plenty of integer registers (64bit) to keep memory locations (rsp, rax, r15 in the above) but there are only 16 ymm registers (256bit). In any case we need to give the compiler a hand, the calculation we need is index = iPoint*nVar*nDim + iVar*nDim + iDim where iPoint is an array of ints Note that as we loop by nDim and then by nVar all we need is to compute iPoint*nVar*nDim outside the loops and then add 1 on each access (which is more or less what the compiler does for the scalar code), in other words we need an iterator, something silly like

template<size_t VecLen, size_t Incr = VecLen>
class GatherIterator
{
  private:
    using IntVec = Array<size_t,VecLen>;
    using FltVec = Array<double,VecLen>;

    IntVec offsets_;
    const double* data_;
  public:
    GatherIterator() = delete;
    GatherIterator(const double* data, IntVec offsets) : offsets_(offsets), data_(data) {}

    STRONGINLINE FltVec operator()() const { return FltVec(data_,offsets_); }

    STRONGINLINE FltVec operator++(int) {
        auto ret = (*this)();  offsets_ += Incr;  return ret;
    }
};

so silly in fact, it only moves forward, we use it in our loop like so

...
        auto gradI = grad.getColIterator(iPoint);
        auto gradJ = grad.getColIterator(jPoint);

        for(size_t iVar=0; iVar<nVar; ++iVar)
        {
            FltVec phiL = 0.0;
            FltVec phiR = 0.0;

            for(size_t iDim=0; iDim<nDim; ++iDim)
            {
                phiL += (gradI++)*d_ij[iDim];
                phiR -= (gradJ++)*d_ij[iDim];
            }
...

to get better assembly

.L7:
        vmovapd ymm3, ymm13
        vmovapd ymm2, YMMWORD PTR [rbp-400]
        add     rax, 32
        vgatherqpd      ymm0, QWORD PTR [rcx+ymm1*8], ymm3
        vpaddq  ymm1, ymm1, ymm11
        vmovapd YMMWORD PTR [rbp-272], ymm0
        vmovapd YMMWORD PTR [rbp-240], ymm0
        vfmadd132pd     ymm0, ymm2, YMMWORD PTR [rax-32]
        vmovdqa YMMWORD PTR [rbp-208], ymm1
        vmovapd YMMWORD PTR [rbp-400], ymm0
        cmp     rax, rbx
        jne     .L7

which makes the vectorized code perform just as well as the scalar code, iterators could also be used for the other variables but that would start to hurt readability without improving the performance much.

Note: There is also a chance the compiler (gcc) is not doing this kind of optimization because of the way I wrote the code...

So we need AoS to avoid losing performance in lightweight numerics classes.

Before we look into the impact of not using SoA in the gradient and limiters routines let me tell you there is a way to have the best of both worlds, enter the array of structures of arrays or as I like to call it zig zag storage, aka a right mess. Imagine an AoS of short arrays of SIMD length, e.g. { {u0 u1 u2 u3} {v0 ... v3} {w0 ... w3} {u4 u5 u6 u7} ... } with that it is possible to fully vectorize point loops as the first index (iPoint) is contiguous in groups of SIMD length and when looping along variables and dimensions in edge loops the stride is small enough (equal to SIMD length) to trigger hardware prefetching. The catch is that we need even more integer arithmetic and so we really need iterators to amortise that cost, there is also the drawback that scalar usage of such a container would be terrible.

For these reasons I think we should sacrifice ultimate performance and keep node data in AoS storage.

The major impact on gradients and limiters is the way the code is written, to vectorize the computation we need to compute the gradient into a local variable and then "transpose" it when storing it, i.e.

        FltVec phiI[MAXNVAR], gradI[MAXNVAR][MAXNDIM];
        ...
            for(size_t iVar=0; iVar<nVar; ++iVar)
            {
                auto flux = weight*(phiI[iVar]+phi.getVec(jPoint,iVar));

                for(size_t iDim=0; iDim<nDim; ++iDim)
                   gradI[iVar][iDim] += a_ij[iDim]*flux;
            }
        }

        for(size_t iVar=0; iVar<nVar; ++iVar)
          for(size_t iDim=0; iDim<nDim; ++iDim)
            for(size_t k=0; k<SIMDLEN; ++k)
              grad(iPoint+k,iVar,iDim) = gradI[iVar][iDim][k];
        ...

Similarly when computing the gradient we need to first fetch/transpose it to be able to vectorize subsequent computations

        FltVec gradI[MAXNVAR][MAXNDIM];

        for(size_t iVar=0; iVar<nVar; ++iVar)
          for(size_t iDim=0; iDim<nDim; ++iDim)
            for(size_t k=0; k<SIMDLEN; ++k)
              gradI[iVar][iDim][k] = grad(iPoint+k,iVar,iDim);
        ...

Performance wise this is actually better than the SoA version (4% on gradients, 35% on limiters) as it also benefits from better locality, and it is only slightly (3%) worse than zig zag storage, especially when fusing limiters and gradients as the transposition of the gradient into storage is greatly amortised. Regarding readability, the 3 nested loops can be moved to methods of the container, but we cannot get rid off the local variable (if we want vectorization that is).

We lose the ability to vectorize primitive variable updates efficiently with AoS but currently that only accounts for 3% of the runtime and it is a memory bound operation therefore it would not gain much from vectorization anyway.

On the subject of de-swizzling data remember I said the writes into CSysMatrix would be a bit weird, that is because each Jacobian contribution will be a "matrix of short arrays" that needs to be transformed into a short array of matrices, the result of that is code like the above that explicitly manipulates the lanes of our SIMD type, such code can be completely hidden inside CSysMatrix which is good because a 4x4 vectorized transpose and matrix update looks like this

// block j, subs from jj and goes to ij
T0 = blk_j[ k ].unpackLo(blk_j[k+1]);    T1 = blk_j[ k ].unpackHi(blk_j[k+1]);
T2 = blk_j[k+2].unpackLo(blk_j[k+3]);    T3 = blk_j[k+2].unpackHi(blk_j[k+3]);

C0 = T0.widePermuteLo(T2);    C1 = T1.widePermuteLo(T3);
C2 = T0.widePermuteHi(T2);    C3 = T1.widePermuteHi(T3);

(Array4d(&bjj[0][k])-C0).store(&bjj[0][k]);
(Array4d(&bjj[1][k])-C1).store(&bjj[1][k]);
(Array4d(&bjj[2][k])-C2).store(&bjj[2][k]);
(Array4d(&bjj[3][k])-C3).store(&bjj[3][k]);

C0.store(&bij[0][k]);    C1.store(&bij[1][k]);
C2.store(&bij[2][k]);    C3.store(&bij[3][k]);

I am showing this because it represents a readability worst case in terms of manipulating SIMD types, we might end up with one or two of these to get the best performance possible but they will always be encapsulated and deep in kernel-type areas of SU2 that are almost never touched.

Conclusions

Over 45% faster CSysMatrix updates by mapping off-diagonal blocks to edges and diagonal blocks to points.
Colouring is the best strategy for hybrid parallelism of compute-heavy edge loops and matrix updates as it interleaves compute and memory operations.
AoS storage should be kept to avoid significant loss of performance in compute-light edge loops due to poor locality of SoA storage.
Major implication of AoS is on point loops where some data needs to be fetched (transposed) into local variables for effective vectorization.
An intermediate storage scheme, AoSoA, can provide both good locality and vectorization of point loops, however it requires that data be accessed via special iterators and scalar code based on it would have poor performance.

Next I will try to estimate how much we can gain for a "realistic" numerics class.

This has been a long long exposition (nerd joke) but bear with me I am almost done, and I will summarise the results in the form of a proposal (I'll probably put that at the top of the first post).

"Real" numerics

Real in the sense that the flop to byte ratio (amount of computation per amount of data) is comparable to a real numerics scheme, say Roe for example. The simplest way to do this is to combine the example code for MUSCL reconstruction with the matrix updates code and add something compute heavy between input and output, e.g. a number of matrix-matrix multiplications, here is some pseudo code for what I did:

void computeResidual(size_t nVar,
                     size_t nDim,
                     const vector<Connectivity<SIMDLEN> >& connectivities,
                     const Matrix& coords,
                     const Matrix& phi,
                     const VectorOfMatrix& grad,
                     const Matrix& limiter,
                     RowMajorMatrix& residual,
                     SparseMatrix& matrix)
{
    using FltVec = Array<double,SIMDLEN>;

    residual.setZero();
    matrix.setDiagZero();

    size_t color = 0;
    for(const auto& connectivity : connectivities)
    {
      #pragma omp parallel for schedule(dynamic,CHUNK_SIZE)
      for(size_t iEdge=0; iEdge<connectivity.size(); iEdge+=SIMDLEN)
      {
          auto iPoint = connectivity.first_vec(iEdge);
          auto jPoint = connectivity.second_vec(iEdge);

          FltVec d_ij[MAXNDIM];
          for(size_t iDim=0; iDim<nDim; ++iDim)
            d_ij[iDim] = (coords.getVec(jPoint,iDim)-coords.getVec(iPoint,iDim))*0.5;

          FltVec phiL[MAXNVAR], phiR[MAXNVAR], flux[MAXNVAR],
                 blk_i[MAXNVAR*MAXNVAR],
                 blk_j[MAXNVAR*MAXNVAR];

          for(size_t iVar=0; iVar<nVar; ++iVar)
          {
              // Reconstruction goes here

              flux[iVar] = (phiL[iVar]+phiR[iVar])*0.5;
          }

          // some silly way to make the Jacobians depend on the reconstruction
          for(size_t iVar=0; iVar<nVar; ++iVar)
            for(size_t jVar=0; jVar<nVar; ++jVar)
              blk_j[iVar*nVar+jVar] = (phiL[iVar]*phiR[jVar]-phiL[jVar]*phiR[iVar])*0.5;

          // the matrix-matrix multiplications
          for(size_t i=0; i<WORKITERS; ++i) {
            // blk_i = blk_j * blk_j
            for(size_t k=0; k<nVar*nVar; ++k) blk_j[k] = blk_i[k];
          }

          // something akin to a dissipation term
          for(size_t iVar=0; iVar<nVar; ++iVar) {
            FltVec sum = flux[iVar];
            for(size_t kVar=0; kVar<nVar; ++kVar)
              sum += blk_j[iVar*nVar+kVar]*(phiL[kVar]-phiR[kVar])*0.5;

          // residuals for iPoint and jPoint updated here

          matrix.updateBlocks_v(color, iEdge, iPoint, jPoint, blk_i, blk_j);
      }
      ++color;
    }
}

The more WORKITERS we have the better the vectorized code is going to look, I used a conservative number based on: For the Roe scheme 4 matrices are generated (Jacobian i, Jacobian j, P tensor, P^-1 tensor), each coefficient of those matrices requires a reasonable number of floating point ops, and two of those matrices are indeed multiplied by each other. So lets say 5 matrix-matrix multiplications are representative, this should be a conservative estimate as I am not considering the eventual fusion of convective and diffusive discretizations.

The vectorized code is 1.5 times faster. This is a fair 1.5 as the code is running on 4 fast cores (parallel via colouring for the reasons I explained previously) and 2 memory channels (scalar code can eventually saturate the memory bandwidth too, but it would take an unreasonable ratio of cores to channels to do so). Furthermore the scalar code I am considering is writing to CSysMatrix with all the mapping and vectorized writes I mentioned before, before you get all compound interest and take this 1.5 with the 1.47 from CSysMatrix, the speedup relative to code without mapping and vector writes is 1.85. I restate that this does not require changes to the data layout, again for reasons previously mentioned.

SpMv - Sparse matrix-vector multiplication

With all these speedups the linear solvers will start taking well over 50% of the time, and so it is desirable to make some improvements there too. Sadly SpMv is as bandwidth bound as it gets, 1 FMA per 8 bytes, nonetheless I implemented some number-of-variable-specific kernels (for nVar=4 and nVar=5) and I can get about 1.12 speedup (same realistic core to channel conditions). I am not going to dump that code here because it is not too nice to look at (it uses intrinsics) but again that would be something hidden away in CSysMatrix that most people would not need to look at, and there would be a safe generic fall-back for arbitrary number of variables.

I think I will do the estimated global speedup together with the summary/proposal.

I'll have the summary at the bottom, you guys read papers you know to look for the conclusions.

Global speedup (Conclusions)

I took the expected speedups for each prototyped area of the code and applied them to the profile I measured for the benchmark case from #716 / #753 (which is not very linear solver intensive). It is hard to estimate the effect of fusing the convective and viscous edge loops, I assumed the cheaper of the two becomes "free" and the other gets a speedup of 1.5 (speedup should be 1.85 accounting for the matrix update optimizations). I assumed that gradients and limiters are not fused and that minor areas of the code get no speedup, which is not necessarily true as this work would require contiguous storage of geometric properties (geometry->node and geometry->edge) and so some speedup is expected due to that, but in the absence of evidence I prefer to be conservative. Here are the numbers: The take home number is 1.7. Cumulative with the 1.4 from contiguous storage, so 2.4 total.

Despite most of my posts being focused on SIMD my main motivation is the hybrid parallelization which will allow important algorithmic improvements when running on hundreds of cores, namely the multigrid and additive linear preconditioners will retain their effectiveness at much larger core counts. I will not hazard an estimate for this.

Proposed changes (Summary)

Hybrid parallel

Gradients (GG) and limiters converted to point loops (embarrassing parallel, no edge colouring needed).
Residual loops parallelized with edge colouring, if anyone has suggestions regarding algorithms send them my way, so far I only tested greedy colouring on edge groups to preserve spatial locality (this aspect is absolutely fundamental).
Matrix multiplication, BCSR format is also embarrassing parallel.
Linear preconditioners, additive decomposition within each MPI rank over a number of threads that need not be the same as that used for other areas.

SIMD

For gradients, limiters, and numerics a SIMD-friendly type will be used, by and large the high level code will look the same (see my examples above) but multiple points/edges will be processed simultaneously.
No changes will be required to the data layout of CVariable, which means there will be no penalty to scalar code, new overloads for accessor methods will be required to set/get SIMD-arrays of data instead of scalars.
The geometric properties and graph information associated with CGeometry will require the same contiguous storage treatment received by CVariable.
Linear algebra, dimension-specific multiplication kernels with manual low level optimisations (unrolling, vectorization through intrinsics, and software prefetching), complex code but encapsulated in a low maintenance area of the code.

General improvements

Fusing convective and viscous residual loops, this will NOT be done by super gluing convective and viscous numerics. I do not have a concrete design in mind yet but I hope to be able to do something akin to the decorator pattern I mentioned before. I also plan to make the numerics a full-fledged visitor to reduce compilation dependencies, currently a change to numerics may affect CVariable CSolver and CNumerics (as mentioned by @clarkpede), CSolver is involved because it is responsible for fetching data, I may move that to CNumerics, which would let them fetch data as needed.
Lookup accesses to CSysMatrix to avoid linear searches, the small increase in memory usage can be mitigated by sharing these lookup tables/arrays and the sparsity pattern between matrices associated with the same geometry (on the subject of these lookup tables, knowing where the diagonal of a row is should enable more efficient upper/lower multiplications).

Work items

This is a lot of work and some changes will be significant, I will divide the work in steps, off the top of my head this order seems ok:

CSysMatrix lookups and any required parallel pragmas put in place.
Point-loop gradients and limiters, no SIMD (at which point I hope #777 to be finished so the next item does not disrupt it).
Split CSolvers over multiple files, colour-based parallel residual loops (at this point we have hybrid parallel!).
Contiguous storage of CGeometry members (to eventually enable vectorization).
Split CNumerics over files, architecture for fusing convective and viscous loops (the most disruptive change).
SIMD, of gradients, limiters, and numerics. SIMD in CSysMatrix can be started anywhere after item 1.

If you foresee conflicts with your ongoing work let's start talking sooner rather than later, I tried to order items to delay major disruptions as much as possible. Also I would prefer not to take the silence of the community as acceptance, when you have a minute to spare (after the 7.0 release) please leave your opinion.

Easter Egg - Mixed Precision

The work proposed above should have the solver run at the speed dictated by the available memory bandwidth (for implicit applications). Keeping CSysMatrix in single precision, and solving the linear systems also in single precision, should therefore provide a good speedup (probably around 1.5 extra speedup). Since all the flux computations would still be done in double precision no loss of accuracy would be incurred. However some stability can be lost on meshes with very high aspect ratios, therefore there would be a compile-time switch for this mode and the default would be all doubles (hence the Easter egg designation).

This should be relatively easy to do cleanly since the relevant classes are already templated and have "mixing-type logic" for when they are used with AD. Except for central schemes, currently we would not gain much by doing it as residual loops are compute-bound and the linear solvers do not use that much time.

Ok, SIMD update, with #753, #959, and #966 we now have a unified storage type for the data we need in CNumerics. This means that we (I) only need to implement "SIMD accessor methods" (i.e. that return a SIMD type instead of a su2double) for one class (C2DContainer and co.).

I think to do SIMD right we need a new way of going about CNumerics, these are my design requirements for "CNewNumerics":

Thread-safe (consequently const-correct), a single object must be safe to use by multiple threads.
Minimal indirection, the polymorphic overhead will be reduced to a single call (ComputeResidual), all auxiliary functions (like the ones that live in CNumerics) will be forcefully inlined, no useless copies of data (no more numerics->SetBlaBla).
Compile time nDim/nVar, which will: a) avoid heap allocations; b) allow effective loop-unrolling (also during the writes to CSysMatrix).
Allow low-overhead composition, e.g. have one CRoe for Euler and one CRoeVisc for NS to fuse convective and viscous residuals (without copy pasting code).
A lot more readable, those cryptic V_i[iDim+1] need to go (somehow...).

To achieve all this, the "CNewNumerics" will work as a template (obvs) decorator/visitor. A visitor in the sense that the solver calls the numerics and gives it (read-only) access to all its data, the object pulls whatever it needs directly and there is no need for numerics->SetBlaBla. A template decorator in the sense that the class can be augmented simply by inheriting from another, along the lines of class CRoeVisc : public CRoe, public CVisc (to allow fusing residual and Jacobian contributions). All this needs to be done with templates for the "minimal indirection" requirement. Which means for each numerical method we will have 4 explicit template instantiations (Euler2D, Euler3D, (RA)NS2D, (RA)NS3D) but in the end these are still polymorphic objects that will be instantiated by some factory function (i.e. it will look clean, especially because I will not port all methods in one go xD).

The template machinery to support this is actually not too crazy:

#include <array>
#include <cmath>

// An example type to use instead of the container that stores solution data for all vertices.
struct SolutionContainer
{
    std::array<double,3> velocity;
    std::array<double,3> areaVector;
};

using ResultType = double;

// We want classes with this interface.
class VirtualInterface
{
  public:
    virtual ResultType Compute(const SolutionContainer&) const = 0;
};

// The Compute method is to be composed via an inheritance chain, to do this
// we allow each building block to inherit from any class. These classes should
// be function objects that have no member variables, all data used in the
// resulting Compute method will be on the stack.
template<typename Base>
class ComputeArea : Base
{
  protected:
    // Different template instantiations will be made for
    // 2D/3D to allow perfect loop unrolling.
    enum : int {nDim = Base::nDim};

    // To share variables between building blocks we will pass
    // down a struct which is also composed by inheritance
    struct WorkVarsType : Base::WorkVarsType
    {
        double area; // add "area" to the variables of Base
    };

    // The final implementation of Compute will be a call down the chain.
    // The final constructed WorkVarsType is not known at this stage,
    // hence we also template the method.
    template<typename WV>
    void Compute(WV& wv, const SolutionContainer& sol) const
    {
        // Boilerplate, call base first. This is akin to the decorator design pattern
        // without polymorphism. The working variables resemble Python's "self" which
        // makes this solution reasonably idiomatic.
        Base::Compute(wv, sol);

        // Then do our specific job.
        wv.area = 0.0;
        for(int i=0; i<nDim; ++i)
            wv.area += pow(sol.areaVector[i],2);
        wv.area = sqrt(wv.area);
    }
};

// Same mechanics as above
template<typename Base>
class ComputeFlux : Base
{
  protected:
    enum : int {nDim = Base::nDim};

    struct WorkVarsType : Base::WorkVarsType 
    {
        double flux; // ...add new member
    };

    template<typename WV>
    void Compute(WV& wv, const SolutionContainer& sol) const
    {
        // ...call base
        Base::Compute(wv,sol);

        // ...do aditional work
        wv.flux = 0.0;
        for(int i=0; i<nDim; ++i)
            wv.flux += sol.velocity[i]*sol.areaVector[i];
    }
};

// This class is used to terminate the chain, it makes the link
// with the interface and it is used to specify any fixed sizes.
template<int NDIM>
class Terminator : private VirtualInterface
{
  protected:
    enum : int {nDim = NDIM};

    struct WorkVarsType {};

    template<typename... Ts>
    void Compute(Ts&...) const {}
};

// Finally we use the building blocks to implement Compute.
// The blocks can be reordered depending on application to
// help the compiler fuse loops or minimize register spillage,
// the resulting WorkVarsType definition will be equivalent.
class ComposedClass: public
    ComputeFlux< ComputeArea< Terminator<3> > >
{
  public:
    ResultType Compute(const SolutionContainer& sol) const;
};

ResultType ComposedClass::Compute(const SolutionContainer& sol) const
{
    // Create the working variables on the stack.
    ComputeFlux::WorkVarsType wv;

    // Pass down the working variables and whatever other arguments.
    // If the convention was followed, all building blocks will run.
    // Recall that all Compute's were templates, they will be
    // instantiated here and we can force them to be inlined.
    ComputeFlux::Compute(wv, sol);

    // Do some additional work if needed and return result.
    return wv.flux / wv.area;
}

Care for some assembly?

Done, but no one wants to review #1022

su2code / SU2