Performance issue with tight loops over slices

dextorious commented 7 years ago

Consider the following simplified collision kernel taken from a lattice Boltzmann simulation:

import std.datetime;
import std.stdio;
import std.random;
import mir.ndslice;

static void collide_kernel(Tensor, Matrix)
(Tensor n, Matrix ux, Matrix uy, Matrix rho,
  immutable double[] ex, immutable double[] ey, immutable double[] w, 
  immutable int Q, immutable int NX, immutable int NY, immutable double omega) 
{
  foreach (i; 0..NX) {
    foreach (j; 0..NY) {
      double ux0 = 0.0;
      double uy0 = 0.0;
      double rho0 = 0.0;
      foreach (q; 0..Q) {
        double nq = n[i][j][q];
        rho0 += nq;
        ux0 += ex[q] * nq;
        uy0 += ey[q] * nq;
      }
      double rho_inv = 1. / rho0;
      ux0 *= rho_inv;
      uy0 *= rho_inv;
      double usqr = ux0 * ux0 + uy0 * uy0;
      foreach (q; 0..Q) {
        double eu = 3 * (ex[q] * ux0 + ey[q] * uy0);
        double neq = rho0 * w[q] * (1.0 + eu + 0.5*eu*eu - 1.5*usqr);
        n[i][j][q] = (1 - omega)*n[i][j][q] + omega*neq;
      }
      ux[i][j] = ux0;
      uy[i][j] = uy0;
      rho[i][j] = rho0;
    }
  }
}

int main(string[] args) {
  immutable uint NX = 1000;
  immutable uint NY = 1000;
  immutable uint Q  = 9;
  auto n = new double[NX*NY*Q].sliced(NX,NY,Q);
  auto ux = new double[NX*NY].sliced(NX,NY);
  auto uy = new double[NX*NY].sliced(NX,NY);
  auto rho = new double[NX*NY].sliced(NX,NY);
  immutable double[Q] ex = [0., 1., 0., -1., 0., 1., -1., -1., 1.];
  immutable double[Q] ey = [0., 0., 1., 0., -1., 1., 1., -1., -1.];
  immutable double[Q] w = [4. / 9., 
    1. / 9., 1. / 9., 1. / 9., 1. / 9., 
    1. / 36., 1. / 36., 1. / 36., 1. / 36.];

  Random gen;
  foreach (i; 0..NX)
    foreach (j; 0..NY) {
      ux[i][j] = 0;
      uy[i][j] = 0;
      rho[i][j] = 1;
      foreach (q; 0..Q)
        n[i][j][q] = uniform(0.0, 1.0, gen);
    }

  uint benchmarkCount = 100u;
  StopWatch sw;
  alias Tensor = typeof(n);
  alias Matrix = typeof(ux);
  sw.start();
  foreach (t; 0..benchmarkCount)
    collide_kernel!(Tensor, Matrix)(n, ux, uy, rho, ex, ey, w, Q, NX, NY, 0.6);
  sw.stop();
  writeln(n[0][0][0]);
  double t1 = sw.peek().usecs;
  writeln(t1 / 1000 / benchmarkCount);
  return 0;
}

when compiled with LDC 1.2.0 (via dub, using --build=release --compiler=ldc and dflags-ldc":["-mcpu=native"] in dub.json and using mir-algorithm v0.6.3) on a Haswell CPU the code yields a consistent timing of 55 ms for the collide_kernel call. The equivalent code in C++ (given in https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18 ), compiled with clang v4.0.0 using -std=c++14 -O3 -march=native yields a consistent timing of 31 ms on the same CPU.

Inspecting the generated assembly reveals the following issues:

Introducing the temporaries rho0, ux0, uy0 is necessary because, unlike clang, LDC doesn't seem convinced that the arrays do not alias. Notice that the C++ version does not use explicit __restrict__ annotations, so there might be room for improvement here. Without the temporaries, the D code runs in 90 ms.
Despite the introduction of the temporaries, the inner loops are not fully unrolled and some vectorization opportunities (particularly on loads/stores) are missed. This is the major outstanding issue, because I currently don't know of a way to get to performance parity with C++.

Please let me know if I can supply any more useful information or test possible workarounds.

9il commented 7 years ago

Try this one

dub build --build=release-nobounds --single collide_bench.d --compiler=ldmd2

/+dub.json:
{
    "name": "collide_bench",
    "dependencies": {
        "mir-algorithm": "~>0.6.4",
        "mir-random": "~>0.2.5"
    },
    "dflags-ldc": ["-mcpu=native"]
}
+/
import std.datetime;
import std.stdio;
import mir.random;
import mir.random.algorithm;
import mir.random.variable;
import mir.ndslice;
import mir.functional;

import ldc.attributes;

@fastmath:

static void collide_kernel
(
    ContiguousTensor!(3, double) n,
    ContiguousMatrix!double ux,
    ContiguousMatrix!double uy,
    ContiguousMatrix!double rho,
    ContiguousVector!(const double) ex,
    ContiguousVector!(const double) ey,
    ContiguousVector!(const double) w,
    in double omega) 
{
    if (n.anyEmpty)
        return;
    auto u = zip(ux, uy, rho);
    do
    {
        auto ni = n.front;
        auto ui = u.front;
        do
        {
            double ux0 = 0;
            double uy0 = 0;
            double rho0 = 0;
            auto nij = ni.front;
            foreach(q; 0 .. nij.length)
            {
                double nq = nij[q];
                rho0 += nq;
                ux0 += ex[q] * nq;
                uy0 += ey[q] * nq;
            }
            ux0 /= rho0;
            uy0 /= rho0;
            double usqr = ux0 * ux0 + uy0 * uy0;
            nij = ni.front;
            foreach(q; 0 .. nij.length)
            {
                double eu = 3 * (ex[q] * ux0 + ey[q] * uy0);
                double neq = rho0 * w[q] * (1.0 + eu + 0.5 * eu * eu - 1.5 * usqr);
                nij[q] = (1 - omega) * nij[q] + omega * neq;
            }
            ui.front.a = ux0;
            ui.front.b = uy0;
            ui.front.c = rho0;
            ni.popFront;
            ui.popFront;
        }
        while(ui.length);
        n.popFront;
        u.popFront;
    }
    while(u.length);
}

int main(string[] args)
{
    immutable size_t NX = 1000;
    immutable size_t NY = 1000;
    immutable size_t Q  = 9;

    auto gen = Random(unpredictableSeed);
    auto var = UniformVariable!double(0, 1);

    auto n = gen.field(var).slicedField([NX,NY,Q]).slice;
    auto ux = slice!double([NX,NY], 0);
    auto uy = slice!double([NX,NY], 0);
    auto rho = slice!double([NX,NY], 0);

    auto ex = [0.0, 1, 0, -1, 0, 1, -1, -1, 1.].sliced;
    auto ey = [0.0, 0, 1, 0, -1, 1, 1, -1, -1].sliced;
    auto w = [4 / 9., 1 / 9., 1 / 9., 1 / 9., 1 / 9.,  1 / 36., 1 / 36., 1 / 36., 1 / 36.].sliced;

    size_t benchmarkCount = 100;
    StopWatch sw;
    sw.start();
    foreach (t; 0 .. benchmarkCount)
        collide_kernel(n, ux, uy, rho, ex, ey, w, 0.6);
    sw.stop();
    writeln(n.first);
    double t1 = sw.peek().usecs;
    writeln(t1 / 1000 / benchmarkCount);
    return 0;
}

dextorious commented 7 years ago

Your version runs in 35.5 ms without @fastmath (to compare with the original C++, which was compiled without it) and 27.5 ms with it, effectively solving the performance issue. It also looks like I have much to learn about idiomatic use of D in general and mir in particular.

With respect to the (vast) differences between the two approaches, is there a single primary cause for the performance differential or is it a matter of stacking small incremental improvements? Is there an underlying reason why the original indexing-based approach cannot match the C++ code? I have many questions about the particular details of the improved code, but I think most of them are better suited to the Gitter chat, so I won't clog up this issue further with those.

The bottom line is: clearly, idiomatically written code can absolutely match and perhaps even surpass the original C++ code. A naive port can't. If there's any interest in identifying the specific reasons for the latter, I'm happy to assist in that endeavour as much as I can, because I think most people coming to D from other languages will only pick up the idiomatic style gradually and if they start out, as I did, with relatively naive ports of existing code, it's easy to get discouraged. I am, however, quite impressed with both the elegance and performance of the optimized version you suggested.

9il commented 7 years ago

With respect to the (vast) differences between the two approaches, is there a single primary cause for the performance differential or is it a matter of stacking small incremental improvements? Is there an underlying reason why the original indexing-based approach cannot match the C++ code?

A compiler get single lengths pack for all pointers. In Mir (and any other matrix library in C++) compiler get length pack for each matrix. The number of general purpose registers is only 16. I have used mir.ndslice.topology: zip on contiguous matrixes. zip fuses shapes and strides, it gives assumption for a compiler that they are the same.
3D Contiguous tensors always recompute their _stride!0 each time because they store only length. So front/popFront usage is preferable (it always preferable for ndslices if number of dimensions is >=2 because it is allow iterate without indexing multiplication).

So the general idea is to reduce number of used registers and number of required index multiplications.

PetarKirov commented 7 years ago

The bottom line is: clearly, idiomatically written code can absolutely match and perhaps even surpass the original C++ code. A naive port can't. If there's any interest in identifying the specific reasons for the latter, I'm happy to assist in that endeavour as much as I can, because I think most people coming to D from other languages will only pick up the idiomatic style gradually and if they start out, as I did, with relatively naive ports of existing code, it's easy to get discouraged. I am, however, quite impressed with both the elegance and performance of the optimized version you suggested.

@dextorious It would be great if you could document your findings. The community needs in general needs more experience with good idioms w.r.t. performance. See also Johan's talk at DConf 2017.

9il commented 7 years ago

The bottom line is: clearly, idiomatically written code can absolutely match and perhaps even surpass the original C++ code. A naive port can't. If there's any interest in identifying the specific reasons for the latter, I'm happy to assist in that endeavour as much as I can, because I think most people coming to D from other languages will only pick up the idiomatic style gradually and if they start out, as I did, with relatively naive ports of existing code, it's easy to get discouraged. I am, however, quite impressed with both the elegance and performance of the optimized version you suggested.

@dextorious I can answer questionы and suggest some idioms, but I am not good in writing docs and tutorials. It would be awesome if you can do something in this area. I will be happy to assist you.

dextorious commented 7 years ago

@9il @ZombineDev I'm not sure how qualified I am to write any tutorials after just a few days of poking around (rather ineptly) at the mir ecosystem, but I'm happy to contribute if I can. One obvious thing that comes to mind would be simply to document and describe my process of porting this simple LB proof of concept code from C++ to D/mir as a learning experience, perhaps in the form of a blog post or something similar? I'm open to suggestions on this front.

wilzbach commented 7 years ago

@dextorious that's an excellent idea. It doesn't need to be formal and I think this is quite a good fit for our blog: http://blog.mir.dlang.io

Also don't worry about making mistakes, we will definitely help you as much as we can on your journey!

9il commented 7 years ago

@dextorious that's an excellent idea. It doesn't need to be formal and I think this is quite a good fit for our blog: http://blog.mir.dlang.io

Also don't worry about making mistakes, we will definitely help you as much as we can on your journey!

+1

dextorious commented 7 years ago

@wilzbach Okay, I'll finish up my code and start writing a post for the mir blog by the end of the week. Lots of questions incoming on Gitter! :)

libmir / mir-algorithm

Performance issue with tight loops over slices #42