Open Qiyu8 opened 2 years ago
cc @peterbell10 and @czgdp1807
AFAIK, as of now, the number of steps in cdist
computations are fixed i.e., O(num_rowsA num_rowsB num_cols) or precisely some_constant num_rowsA num_rowsB * num_cols. Now, the PRs https://github.com/scipy/scipy/pull/14611 and https://github.com/scipy/scipy/pull/14640 extend on the map-reduce and loop unrolling framework implemented already in main
branch. This exploits, the ILP available on a hardware. Like from what I understood so far while working on gh-14611 and gh-14640 is that we divide the whole problem into multiple chunks and store the result of each chunk into different memory locations. This helps in increasing the throughput because CPU can simultaneously execute some part of each instruction. This helps in reducing the some_constant
factor in the number of steps. However, since we are still using sequential computation so asymptotically for very large inputs (like in your case @Qiyu8) the improvement in some_constant
won't help much. So, if we go for reducing the value of some_constant
we will see performance improvements on small inputs but for asymptotically large inputs we won't see a lot of difference because at that scale constants in time complexities don't matter much. So, I agree with @Qiyu8. While working on mahalanobis in 14611 I observed the same thing that the number of steps increase by a huge amount and then no matter how much loop unrolling is done to increase ILP, difference is little.
Please feel free to correct me if you find something in my thoughts above. Would be happy to update my two PRs accordingly.
I don't know the intricacies of cdist
edge cases, but for Euclidean norm of an array, the C code given and ?nrm2
(which is a Level 1 BLAS function) will not make much difference since both are point-wise calculations running over an array. The nice part of ?nrm2
is that it is careful about under- and over-flows as an extra which the C-code doesn't check and I think is very nice to have. Also in terms of code efforts it's much cleaner but that's a subjective preference.
The typical (at least that I am aware of) is to write such that the architecture allows for natural blocking that is utilizing Level-3 BLAS code. Level-1 tricks won't bring the performance that these blocked optimizations bring and both will have a very large constant in O(nmp) term. Because mainly this operation is IO bounded, that is to say, data transfer cost is higher than compute cost. Hence the bottleneck.
One way is to write divide-and-conquer which, if I am not wrong, what CUDA/GPU folks prefer, the other one is to write the problem in matrix-matrix product and optimize over matmul.
Each pair of vector from XA and XB require O(num_cols) steps. So, if we compute a bunch of pairs simultaneously then that might bring performance improvements. However, if we do computations for each pair of vectors from XA and XB then a minimum of o(num_rowsA * num_rowsB) will happen so only scope for optimization is to reduce the O(num_cols) per unit of time. For example divide the vector of size num_cols
into a bunch of parts (say n), then compute these n results simultaneously and then reduce these n results to a single value. However, for very small value of num_cols, the overhead will be huge so there will be performance degradation. What are your thoughts, @ilayn ?
Complexity computations are always tricky because you are missing the constants. Unfortunately even the order of operations matter hence we can't generalize these estimates to actual implementations. Obviously at the lower limit there will be a point where linalg.norm(A-B)
will start beating any implementation in terms of performance at some point. But that is something we should ignore since then you don't need to use cdist
but for large problems https://hal.archives-ouvertes.fr/hal-02047514/document (Section 2) can give you a better idea of how matrix product idea is implemented. The blocking/chunking is done over those products for which Level-3 operations are available.
I didn't do an comprehensive search this is the second that came out of scholar search for EDM problem. There are obviously gazillions of algorithms on the subject. What I am hoping to emphasize is that it is better to chunk operations that give way to blocked operations and not chunking essentially Level-1 ops.
Here is another Python level implementation to illustrate the basic idea (and apparently suffering from overflows in the accepted answer)
nrm2
isn't directly helpful because it calculates the euclidean norm, not distance. We would need to first write the intermediate difference vector into memory and pass that to nrm2
which could introduce a lot of overhead. Other changes like (x-y)^2 => x^2 + y^2 - 2xy
have come up before but since they're a trade-off for loss of precision, I don't think it's suitable for a general purpose library.
Adding vectorization directly to our distance calculation code would be reasonable, we just need to figure out runtime cpu feature detection so the code can fallback to a non-vectorized version on older CPUs.
The difference has to be calculated regardless hence the working array is unavoidable.
Same precision issue is also present in the square of the difference anyhow. So that trade-off is about which case is sacrificed since ||(x-y)||
is also quite bad when x and y are very close to each other.
If mostly things are done column-wise, then vectorization won't shave off too much performance due to cache-misses and should work on the row direction in case this is Cythonized or some other scheme is implemented.
The difference has to be calculated regardless hence the working array is unavoidable.
An intermediate calculation step is required, but this result remains entirely within CPU registers; whereas calling dnrm2
would require O(k) additional memory for k-dimensional distances and O(N) additional memory accesses.
Same precision issue is also present in the square of the difference anyhow. So that trade-off is about which case is sacrificed since ||(x-y)|| is also quite bad when x and y are very close to each other.
When x and y are close to each other the limitation is in the representation of x and y as floating point values which is unavoidable and completely outside of the difference calculation itself. On the other hand, representing x^2
and y^2
introduces additional and unnecessary imprecision, over and above that inherent to floating point representations of the input. These are not the same.
When the difference x-y is small, elementwise abs calculation also loses precision in squaring. Squaring precision loses on the other end of the spectrum when x or y is large. It's a matter of tradeoff. For moderate values the precision loss is not going to affect.
If you are doing a running total of difference then you are back at the original problem given in OP that is you have to run over columns which will have cache misses if the array is C-contiguous. So performance will be affected. O(n) memory access on a contiguous memory is not nearly bad and will give a chance to harden for under-/overflow cases which is what ?nrm
does. But for substantial performance increase a not so straightforward blocking scheme is needed.
When the difference x-y is small, elementwise abs calculation also loses precision in squaring.
If (x-y)^2
is small, then x^2+y^2 -2xy
is going to be equally small and suffer the exact same rounding error as a result of representing it as a float.
If you are doing a running total of difference then you are back at the original problem given in OP that is you have to run over columns which will have cache misses if the array is C-contiguous. So performance will be affected.
A C-contiguous array will have each point vector as a contiguous sub-array, which means perfectly linear data access.
If (x-y)^2 is small, then x^2+y^2 -2xy is going to be equally small and suffer the exact same rounding error as a result of representing it as a float.
Yes that's why precision loss is not the issue here. Hence blocking is not going to introduce any loss.
C arrays are row aligned. So if you are going through columns that's typical cache miss situation. C contiguousness requires you run over the rows, in Fortran it's columns. If you have vectors in a NumPy array chances are your point vectors are along columns but transposed at the input (which is typically the case for the users).
The point here is blocked computation not memory access speed. But anyways I am repeating myself I'll leave it to the residents.
Is your feature request related to a problem? Please describe.
cdist
function is widely used in AI-related applications, but the performance is not good enough. Here is one of the practical use case:The code above takes 6 mins in my server machine, which is unbearable in practical applications.
Describe the solution you'd like.
After the initial reading of the code, The Performance bottlenecks were found in
distance_metrics.h
:The transform operation is too slow if the scale of matrix is large, because It didn't take the advantage of acceleration library such as MKL/OpenBLAS,and no vectorized optimization is used during caculation. What about replace this implementation with more effective methods such as euclidean distance? which uses highly optimized BLAS function
nrm2
.Describe alternatives you've considered.
No response
Additional context (e.g. screenshots, GIFs)
No response