OMP accelerated loops causing massive slow-down

cpviolator commented 3 years ago

In the IRAM eigensolver there is some quarantined code in qrIteration function that uses OMP. When activated it wall cause a x100 slowdown on the loop. The routine was prototyped in laptop code

https://github.com/cpviolator/VOATOL/blob/master/include/algoHelpers.h#L470

and gives good speed-up there. The QUDA code is here:

https://github.com/lattice/quda/blob/feature/arnoldi/lib/eig_iram.cpp#L272

Not sure what the problem is as the #pragmas are pretty straight forward.

kostrzewa commented 3 years ago

Interesting. Note that the QUDA code is also buggy as temp is not thread-local. As for performance, this has a huge number of implicit barriers.

kostrzewa commented 3 years ago

What's the usual size of n_kr here? If too small, then the static work assignment with a chunk size of 32 will lead to it running essentially single-threaded (or with few threads).

kostrzewa commented 3 years ago

Also, as i increases, the amount of work in the interior loop over j will inevitably become small, compounding the issue in the above comment.

cpviolator commented 3 years ago

Thanks for pointing out the thread local bug. I had it coded properly, then deleted it all, but wanted to add the quarantined code back in so copied it sloppily :(

I did have the following booleans in there at one point:

      // Continue for the other columns
if(n_kr - (i+1)) > 64) {
#pragma omp parallel for schedule(static,32)
      for(int j=i+1; j < n_kr; j++) {
    Complex tempp = R[i][j];
    R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);      
    R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
      }
} else {
      for(int j=i+1; j < n_kr; j++) {
        temp = R[i][j];
    R[i][j] -= (T11 * temp + T12 * R[i+1][j]);      
    R[i+1][j] -= (T21 * temp + T22 * R[i+1][j]);
     }
}

(and similar for the other loops) as a heuristic way of ensuring OpenMP offload only when it might be worth it. It seemed to work well on my laptop, but here in QUDA, any OMP pragmas destroy the loop speed.

cpviolator commented 3 years ago

BTW, for a decent calculation size, n_kr would be around 1024.

cpviolator commented 3 years ago

I did wonder it if was data ordering. On my laptop, I use Eigen objects when are in row major, and the matrices Q, and R are updated row by row. However, when I refactored the above to use row majored object, the same slow down was seen.

kostrzewa commented 3 years ago

A possiblity might be that your Eigen-based code is simply very slow, so it scales well with the number of threads.

kostrzewa commented 3 years ago

Which architecture did you observe the slowdown in QUDA on, by the way? If this was on Power9, then I wouldn't be too surprised. Look at the behaviour of the stream triad on a dual-Power9 with 2x16 cores:

kostrzewa commented 3 years ago

Also on EPYC, threading overheads for small working sets clearly dominate multi-threaded bandwidth:

cpviolator commented 3 years ago

@kostrzewa These plots are gorgeous!

I tested on two architectures: Haswell and Power9. I actually resolved the issue on Haswell because there were a proliferation of background jobs going on during the OMP tests. Once they had all calmed down (and I used an all important export MV2_ENABLE_AFFINITY=0 then I saw a healthy 2x speed up with 6 OMP cores Vs a LAPACKE accelerated Eigen. A lot of that speed up comes from the fact that we can control the tolerance on the QR iteration. We find that for a tolerance of 1e-n on eigenpair residua, the QR tolerance is stable for 1e-(n+1) and smaller. We can set the maximum iterations used for Eigen's Schur decomposition, but this does not have a reliable mapping to tolerance, so introduces instability. The final version of the OMP code I settled on is:

      // Continue for the other columns                                                                                                                                           
#ifdef _OPENMP
#pragma omp parallel for schedule(static,32)
#endif
      for(int j=i+1; j < n_kr; j++) {
        Complex tempp = R[i][j];
        R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);
        R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
      }
    }

    // Rotate R and V, i.e. H->RQ. V->VQ                                                                                                                                          
    // Loop over columns of upper Hessenberg                                                                                                                                      
    for(int j = 0; j < n_kr - 1; j++) {
      if(abs(R11[j]) > tol) {
        // Loop over the rows, up to the sub diagonal element i=j+1                                                                                                               
#ifdef _OPENMP
#pragma omp parallel
        {
#pragma omp for schedule(static,32) nowait
#endif
          for(int i = 0; i < j+2; i++) {
            Complex tempp = R[i][j];
            R[i][j] -= (R11[j] * tempp + R12[j] * R[i][j+1]);
            R[i][j+1] -= (R21[j] * tempp + R22[j] * R[i][j+1]);
          }
#ifdef _OPENMP
#pragma omp for schedule(static,32) nowait
#endif
          for(int i = 0; i < n_kr; i++) {
            Complex tempp = Q[i][j];
            Q[i][j] -= (R11[j] * tempp + R12[j] * Q[i][j+1]);
            Q[i][j+1] -= (R21[j] * tempp + R22[j] * Q[i][j+1]);
          }
        }
      }
    }

I've yet to observe this speed up on Power9. I'm wondering if there is some equivalent MV2_ENABLE_AFFINITY on Summit I need to set.

cpviolator commented 3 years ago

So, on Summit's Power9, the command to use is:

jsrun --nrs 1 -a1 -g1 -c${CPU} -EOMP_NUM_THREADS=${CPU} -brs -l gpu-gpu

I tested with for CPU in 1 2 4 8 16 32 ; do using the flag --eig-use-eigen-qr false to ensure the OMP code path, using schedule(static,32) for all loops, with the following results (relevant for the dense algebra): N_CPU = 1: ~90 secs

eigenEV     = 12.020191 secs (  5.15%),  with        6 calls at 2.003365e+06 us per call
host compute     = 78.627692 secs (  33.7%),     with       10 calls at 7.862769e+06 us per call

N_CPU=2: ~56 secs

eigenEV     = 7.007859 secs (  3.56%),   with        6 calls at 1.167976e+06 us per call
host compute     = 49.272701 secs (  25.1%),     with       10 calls at 4.927270e+06 us per call

N_CPU=4: ~44 secs

eigenEV     = 4.429335 secs (  2.45%),   with        6 calls at 7.382225e+05 us per call
host compute     = 36.023938 secs (  19.9%),     with       10 calls at 3.602394e+06 us per call

N_CPU=8: ~ 42 secs

eigenEV     = 3.145898 secs (  1.73%),   with        6 calls at 5.243163e+05 us per call
host compute     = 38.750711 secs (  21.3%),     with       10 calls at 3.875071e+06 us per call

N_CPU=16: 35 secs

eigenEV     = 2.503287 secs (  1.43%),   with        6 calls at 4.172145e+05 us per call
host compute     = 32.357466 secs (  18.5%),     with       10 calls at 3.235747e+06 us per call

N_CPU=32: ~75 secs

eigenEV     = 2.193851 secs (  1.02%),   with        6 calls at 3.656418e+05 us per call
host compute     = 72.706578 secs (  33.8%),     with       10 calls at 7.270658e+06 us per call

So using N_CPU=16 gives the best speed-up. In contrast, the baseline Eigen implementation with LAPACKE acceleration, using 1 CPU and 16 CPUS gives: EIGEN, N_CPU=1: ~81 secs

eigenEV     = 10.896648 secs (  4.91%),  with        6 calls at 1.816108e+06 us per call
eigenQR     = 42.431922 secs (  19.1%),  with        6 calls at 7.071987e+06 us per call
host compute     = 28.158312 secs (  12.7%),     with        4 calls at 7.039578e+06 us per call

EIGEN N_CPU=16: ~66secs

eigenEV     = 2.430413 secs (  1.22%),   with        6 calls at 4.050688e+05 us per call
eigenQR     = 42.681786 secs (  21.4%),  with        6 calls at 7.113631e+06 us per call
host compute     = 11.136929 secs (  5.59%),     with        4 calls at 2.784232e+06 us per call

So it looks like the best speed up comes from using the host OMP accelerated code, 16 CPUs, and a QR tolerance set to one order of magnitude smaller than the eigensolver tolerance.

kostrzewa commented 3 years ago

@kostrzewa These plots are gorgeous!

Cheers, these are essentially one-liners in ggplot2 (https://ggplot2.tidyverse.org/).

I'm happy that you've found scaling after all!

lattice / quda

OMP accelerated loops causing massive slow-down #1063