Open learning-chip opened 3 years ago
For reference, the SciPy sparse dot code is extremely simple:
template <class I, class T>
void csr_matvec(const I n_row,
const I n_col,
const I Ap[],
const I Aj[],
const T Ax[],
const T Xx[],
T Yx[])
{
for(I i = 0; i < n_row; i++){
T sum = Yx[i];
for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
sum += Ax[jj] * Xx[Aj[jj]];
}
Yx[i] = sum;
}
}
Taken from https://github.com/scipy/scipy/blob/v1.6.2/scipy/sparse/sparsetools/csr.h#L1119-L1135
@learning-chip Thanks for providing the reproducible notebook. It made everything easier 👍
First of all, what you were timing includes compilation and running time. This was fixed since the latest release (should be part of the upcoming v0.10.9, which also includes Python 3.9 support). If you run the code through Binder with the latest master branch the time goes down to 16.4ms, but the first run will be slower since it will compile the code once.
If you want to use the current 0.10.8 release, you can also precompile the SDFG yourself:
cspmv = spmv.compile()
%time cspmv(A_row=A_row, A_col=A_col, A_val=A_val, x=x, b=b, H=H, W=W, nnz=nnz)
The second part of the question is the transformations. If I recall correctly, the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.
If you run the code through Binder with the latest master branch the time goes down to 16.4ms
Thanks, it now runs as fast as SciPy on my machine.
the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.
Anything beyond loop tiling and SIMD? Since SpMV is a typical memory-bound kernel, can DaCe insert software prefetch instructions such as _mm_prefetch
for x86 CPU? Or is it left to the C++ compiler to recognize prefetch-able patterns?
Great question! We don't provide this transformation as part of our standard set, but it's possible to write a simple transformation that gets the memory access patterns from the memlets and inserts a prefetch tasklet. We would be happy to accept any contribution as a PR :)
it's possible to write a simple transformation that gets the memory access patterns from the memlets and inserts a prefetch tasklet.
Thanks, will come back if I have any updates 👍
Oh I forgot to ask -- where can I find the exact code for loop tiling and SIMD transformations, which were used by the SpMV benchmark in the DaCe paper? Those seem to be non-trivial optimizations, as tiling and SIMD for sparse operations are not as obvious as for GEMM.
Hi @learning-chip, while I did not contribute to the original paper you mentioned, I guess you should try standard tiling and vectorization transformations, which can be found [here] and [here]. They should be general enough for dynamic maps as well. Also maybe have a look at [SubgraphFusion] for loop/operator fusion. You can apply transformations either in the GUI or via code as stated in this tutorial. I do not know what exactly they did in the OG paper (@tbennun) to achieve said results, however.
Hello @learning-chip. We do have the optimized SpMV SDFG. Unfortunately, it is based on a quite old version of the SDFG specification, and it is not valid according to the current one. Therefore, I attempted a quick update so that we can at least showcase the optimizations that we did:
The optimizations are basically two:
for j in dace.map[A_row[i]:A_row[i+1]]
) as @sancierra suggested. On the left inner map, we perform vector_length
(e.g., 4) indirections of x
at a time and store them in a temporary vector, which is then multiplied by A_val[j:j+vector_length]
, while the result is reduced down and added to b
. The right map handles the remainder in case A_row[i+1]-A_row[i]
does not divide evenly by vector_length
.b
in the form of b_tmp
so that the threads don't add directly to b
and potentially cause low performance due to cache invalidations.I am attaching the (updated) SDFG and a script to run it below (in the zip file); however, there are a couple of caveats:
First, it will not validate as-is because we have changed how edges with a write-conflict-resolution work. These edges used to have a property called wcr_identity
that effectively initialized the data they were pointing to at the start of the current scope.
This feature was used to initialize b_tmp
to zero at the start of every iteration of the row map (for i in dace.map[0:H]
). Since wcr_identity
was removed (for reasons that might be out of this discussion's scope), b_tmp
is not initialized and garbage values are returned. A temporary quick fix is to do a very simple change to generated code:
#pragma omp parallel for
for (auto i = 0; i < H; i += 1) {
unsigned int rowptr;
unsigned int rowend;
float b_tmp; // <---- SET THIS TO ZERO (float b_tmp = 0;)
Second, there might be some performance regression (I am not sure, though) due to the use of atomics in the edges with write-conflict resolution. However, it is easy to disable if needed.
I hope this helps!
@sancierra @alexnick83 Wonderful, thanks for the detailed replies! Let me take a closer look.
Problem description
In the DaCe paper, it is stated that DaCe SpMV is as fast as MKL:
However, I found it 20x slower than
scipy.sparse.csr_matrix.dot
. 1.4s vs 60ms for the problem size used in the DaCe paper.It is probably because I did not apply any transformations to optimize performance. But I could not find more words in the paper about SpMV optimization, except this short paragraph:
I also tried this sample code from the paper, slightly different from the GitHub version. But got
TypeError: dtype must be a DaCe type, got __map_8_b0
at runtime.Environment
Conda
environment.yml
is