Closed cpviolator closed 3 years ago
Interesting. Note that the QUDA code is also buggy as temp
is not thread-local. As for performance, this has a huge number of implicit barriers.
What's the usual size of n_kr
here? If too small, then the static
work assignment with a chunk size of 32
will lead to it running essentially single-threaded (or with few threads).
Also, as i
increases, the amount of work in the interior loop over j
will inevitably become small, compounding the issue in the above comment.
Thanks for pointing out the thread local bug. I had it coded properly, then deleted it all, but wanted to add the quarantined code back in so copied it sloppily :(
I did have the following booleans in there at one point:
// Continue for the other columns
if(n_kr - (i+1)) > 64) {
#pragma omp parallel for schedule(static,32)
for(int j=i+1; j < n_kr; j++) {
Complex tempp = R[i][j];
R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);
R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
}
} else {
for(int j=i+1; j < n_kr; j++) {
temp = R[i][j];
R[i][j] -= (T11 * temp + T12 * R[i+1][j]);
R[i+1][j] -= (T21 * temp + T22 * R[i+1][j]);
}
}
(and similar for the other loops) as a heuristic way of ensuring OpenMP offload only when it might be worth it. It seemed to work well on my laptop, but here in QUDA, any OMP pragmas destroy the loop speed.
BTW, for a decent calculation size, n_kr
would be around 1024.
I did wonder it if was data ordering. On my laptop, I use Eigen
objects when are in row major, and the matrices Q, and R are updated row by row. However, when I refactored the above to use row majored object, the same slow down was seen.
A possiblity might be that your Eigen-based code is simply very slow, so it scales well with the number of threads.
Which architecture did you observe the slowdown in QUDA on, by the way? If this was on Power9, then I wouldn't be too surprised. Look at the behaviour of the stream triad on a dual-Power9 with 2x16 cores:
Also on EPYC, threading overheads for small working sets clearly dominate multi-threaded bandwidth:
@kostrzewa These plots are gorgeous!
I tested on two architectures: Haswell and Power9. I actually resolved the issue on Haswell because there were a proliferation of background jobs going on during the OMP tests. Once they had all calmed down (and I used an all important export MV2_ENABLE_AFFINITY=0
then I saw a healthy 2x speed up with 6 OMP cores Vs a LAPACKE accelerated Eigen. A lot of that speed up comes from the fact that we can control the tolerance on the QR iteration. We find that for a tolerance of 1e-n on eigenpair residua, the QR tolerance is stable for 1e-(n+1) and smaller. We can set the maximum iterations used for Eigen's Schur decomposition, but this does not have a reliable mapping to tolerance, so introduces instability. The final version of the OMP code I settled on is:
// Continue for the other columns
#ifdef _OPENMP
#pragma omp parallel for schedule(static,32)
#endif
for(int j=i+1; j < n_kr; j++) {
Complex tempp = R[i][j];
R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);
R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
}
}
// Rotate R and V, i.e. H->RQ. V->VQ
// Loop over columns of upper Hessenberg
for(int j = 0; j < n_kr - 1; j++) {
if(abs(R11[j]) > tol) {
// Loop over the rows, up to the sub diagonal element i=j+1
#ifdef _OPENMP
#pragma omp parallel
{
#pragma omp for schedule(static,32) nowait
#endif
for(int i = 0; i < j+2; i++) {
Complex tempp = R[i][j];
R[i][j] -= (R11[j] * tempp + R12[j] * R[i][j+1]);
R[i][j+1] -= (R21[j] * tempp + R22[j] * R[i][j+1]);
}
#ifdef _OPENMP
#pragma omp for schedule(static,32) nowait
#endif
for(int i = 0; i < n_kr; i++) {
Complex tempp = Q[i][j];
Q[i][j] -= (R11[j] * tempp + R12[j] * Q[i][j+1]);
Q[i][j+1] -= (R21[j] * tempp + R22[j] * Q[i][j+1]);
}
}
}
}
I've yet to observe this speed up on Power9. I'm wondering if there is some equivalent MV2_ENABLE_AFFINITY
on Summit I need to set.
So, on Summit's Power9, the command to use is:
jsrun --nrs 1 -a1 -g1 -c${CPU} -EOMP_NUM_THREADS=${CPU} -brs -l gpu-gpu
I tested with for CPU in 1 2 4 8 16 32 ; do
using the flag --eig-use-eigen-qr false
to ensure the OMP code path, using schedule(static,32)
for all loops, with the following results (relevant for the dense algebra):
N_CPU = 1: ~90 secs
eigenEV = 12.020191 secs ( 5.15%), with 6 calls at 2.003365e+06 us per call
host compute = 78.627692 secs ( 33.7%), with 10 calls at 7.862769e+06 us per call
N_CPU=2: ~56 secs
eigenEV = 7.007859 secs ( 3.56%), with 6 calls at 1.167976e+06 us per call
host compute = 49.272701 secs ( 25.1%), with 10 calls at 4.927270e+06 us per call
N_CPU=4: ~44 secs
eigenEV = 4.429335 secs ( 2.45%), with 6 calls at 7.382225e+05 us per call
host compute = 36.023938 secs ( 19.9%), with 10 calls at 3.602394e+06 us per call
N_CPU=8: ~ 42 secs
eigenEV = 3.145898 secs ( 1.73%), with 6 calls at 5.243163e+05 us per call
host compute = 38.750711 secs ( 21.3%), with 10 calls at 3.875071e+06 us per call
N_CPU=16: 35 secs
eigenEV = 2.503287 secs ( 1.43%), with 6 calls at 4.172145e+05 us per call
host compute = 32.357466 secs ( 18.5%), with 10 calls at 3.235747e+06 us per call
N_CPU=32: ~75 secs
eigenEV = 2.193851 secs ( 1.02%), with 6 calls at 3.656418e+05 us per call
host compute = 72.706578 secs ( 33.8%), with 10 calls at 7.270658e+06 us per call
So using N_CPU=16 gives the best speed-up. In contrast, the baseline Eigen implementation with LAPACKE acceleration, using 1 CPU and 16 CPUS gives: EIGEN, N_CPU=1: ~81 secs
eigenEV = 10.896648 secs ( 4.91%), with 6 calls at 1.816108e+06 us per call
eigenQR = 42.431922 secs ( 19.1%), with 6 calls at 7.071987e+06 us per call
host compute = 28.158312 secs ( 12.7%), with 4 calls at 7.039578e+06 us per call
EIGEN N_CPU=16: ~66secs
eigenEV = 2.430413 secs ( 1.22%), with 6 calls at 4.050688e+05 us per call
eigenQR = 42.681786 secs ( 21.4%), with 6 calls at 7.113631e+06 us per call
host compute = 11.136929 secs ( 5.59%), with 4 calls at 2.784232e+06 us per call
So it looks like the best speed up comes from using the host OMP accelerated code, 16 CPUs, and a QR tolerance set to one order of magnitude smaller than the eigensolver tolerance.
@kostrzewa These plots are gorgeous!
Cheers, these are essentially one-liners in ggplot2 (https://ggplot2.tidyverse.org/).
I'm happy that you've found scaling after all!
In the IRAM eigensolver there is some quarantined code in
qrIteration
function that uses OMP. When activated it wall cause a x100 slowdown on the loop. The routine was prototyped in laptop codehttps://github.com/cpviolator/VOATOL/blob/master/include/algoHelpers.h#L470
and gives good speed-up there. The QUDA code is here:
https://github.com/lattice/quda/blob/feature/arnoldi/lib/eig_iram.cpp#L272
Not sure what the problem is as the
#pragma
s are pretty straight forward.