Closed darshanp4 closed 9 months ago
Pinging people that might know the answer straight away: @jjerphan @ogrisel @fcharras @Micky774?
I was looking at the following that could be interested as well: https://www.osti.gov/servlets/purl/1557469
But it seems that SYRK should skip some multiplication that GEMM is performing.
SYRK
must in theory be more appropriate than GEMM
in this case (i.e. when $A = B$), but only comparing implementations will tell us.
@glemaitre thanks for adding more folks, also I will take a look at paper : https://www.osti.gov/servlets/purl/1557469.
@jjerphan yes in theory it is more appropriate, i don't have comparison. I will try see by adding it. do you have any reference , how can we add blas functions.
thank you!
Looking forward to a PR with some quick benchmarks.
For information, numpy uses syrk under the hood if it detects X.T @ X
.
Also, optimizations will work for full matrices, for pattern relying on chunks (à la PairwiseDistancesReduction
) we can only benefit from SYRK
on the diagonal chunks.
@jjerphan so for PairwiseDistancesReduction
, if it is already using the chunks of 256 size, how we can identify the diagonal chunks?
X_start == Y_start and X_end == Y_end
Any thoughts on how we can make it work for full matrices!
Also can you help with understanding why at scikit-learn we are relying on chunks, as in backend OpenBLAS is also doing it!
X is Y and X_start == Y_start and X_end == Y_end
should be sufficient to identify when to use SYRK
, but this might add too much complexity and it might cause regression due to branching and branch mis-prediction for every n_chunks + 1
iterations here:
You can try and perform benchmarks with https://github.com/scikit-learn/pairwise-distances-reductions-asv-suite, but a priori using SYRK
in PairwiseDistancesReduction
would not help significantly.
Chunks are used for PairwiseDistancesReduction
as explained in the private submodule documentation, here:
Let us know if something is unclear.
To keep intermediate data-structures in CPU cache, cache size matters for performance. which will be different for every ISA. But here it is hardcoded to 256. In OpenBLAS it is handled for each ISA. So shouldn't we leave it for OpenBLAS to handle it more optimize way.
The point is that the pairwise distances performe reductions (aggregation functions) over a chunk of distance values. This chunk size was empirically tested, and 256 chosen as a good middle ground. This is much different from BLAS functions, e.g. matrix-matrix operations is just multiplications and additions, nothing else. Therefore, the cache is even more important and, I guess, the most engineering-hours of work put into optimizing it - out of all algos!
Back to this issue: As SYRK would only apply to the diagonal, it is a 2. order effect. I'm -1 on it considering the trade-offs with code complexity.
Discussed in https://github.com/scikit-learn/scikit-learn/discussions/27877