ENH: Optimizing scipy.spatial.distance.cdist by using Acceleration library

Qiyu8 commented 2 years ago

Is your feature request related to a problem? Please describe.

cdist function is widely used in AI-related applications, but the performance is not good enough. Here is one of the practical use case:

import numpy as np
from scipy.spatial.distance import cdist
dims_a1 = 3360
dims_a0 = 3072
dims_b1 = 15904
dims_b0 = 3072
vector_a = np.array([[i + 0.1*j for i in range(dims_a0)] for j in range(dims_a1)], np.float32)
vector_b = np.array([[i + j/2 for i in range(dims_b0)] for j in range(dims_b1)], np.float32)
res = cdist(vector_a, vector_b)

The code above takes 6 mins in my server machine, which is unbearable in practical applications.

Describe the solution you'd like.

After the initial reading of the code, The Performance bottlenecks were found in distance_metrics.h:

void operator()(StridedView2D<T> out, StridedView2D<const T> x, StridedView2D<const T> y) const {
        transform_reduce_2d_(out, x, y, [](T x, T y) INLINE_LAMBDA {
            auto diff = std::abs(x - y);
            return diff * diff;
        },
        [](T x) { return std::sqrt(x); });
    }

The transform operation is too slow if the scale of matrix is large, because It didn't take the advantage of acceleration library such as MKL/OpenBLAS，and no vectorized optimization is used during caculation. What about replace this implementation with more effective methods such as euclidean distance? which uses highly optimized BLAS function nrm2.

if ord in (None, 2) and (a.ndim == 1):
            # use blas for fast and stable euclidean norm
            nrm2 = get_blas_funcs('nrm2', dtype=a.dtype, ilp64='preferred')
            return nrm2(a)

Describe alternatives you've considered.

No response

Additional context (e.g. screenshots, GIFs)

No response

tupui commented 2 years ago

cc @peterbell10 and @czgdp1807

czgdp1807 commented 2 years ago

AFAIK, as of now, the number of steps in cdist computations are fixed i.e., O(num_rowsA num_rowsB num_cols) or precisely some_constant num_rowsA num_rowsB * num_cols. Now, the PRs https://github.com/scipy/scipy/pull/14611 and https://github.com/scipy/scipy/pull/14640 extend on the map-reduce and loop unrolling framework implemented already in main branch. This exploits, the ILP available on a hardware. Like from what I understood so far while working on gh-14611 and gh-14640 is that we divide the whole problem into multiple chunks and store the result of each chunk into different memory locations. This helps in increasing the throughput because CPU can simultaneously execute some part of each instruction. This helps in reducing the some_constant factor in the number of steps. However, since we are still using sequential computation so asymptotically for very large inputs (like in your case @Qiyu8) the improvement in some_constant won't help much. So, if we go for reducing the value of some_constant we will see performance improvements on small inputs but for asymptotically large inputs we won't see a lot of difference because at that scale constants in time complexities don't matter much. So, I agree with @Qiyu8. While working on mahalanobis in 14611 I observed the same thing that the number of steps increase by a huge amount and then no matter how much loop unrolling is done to increase ILP, difference is little.

Please feel free to correct me if you find something in my thoughts above. Would be happy to update my two PRs accordingly.

ilayn commented 2 years ago

I don't know the intricacies of cdist edge cases, but for Euclidean norm of an array, the C code given and ?nrm2 (which is a Level 1 BLAS function) will not make much difference since both are point-wise calculations running over an array. The nice part of ?nrm2 is that it is careful about under- and over-flows as an extra which the C-code doesn't check and I think is very nice to have. Also in terms of code efforts it's much cleaner but that's a subjective preference.

The typical (at least that I am aware of) is to write such that the architecture allows for natural blocking that is utilizing Level-3 BLAS code. Level-1 tricks won't bring the performance that these blocked optimizations bring and both will have a very large constant in O(nmp) term. Because mainly this operation is IO bounded, that is to say, data transfer cost is higher than compute cost. Hence the bottleneck.

One way is to write divide-and-conquer which, if I am not wrong, what CUDA/GPU folks prefer, the other one is to write the problem in matrix-matrix product and optimize over matmul.

czgdp1807 commented 2 years ago

Each pair of vector from XA and XB require O(num_cols) steps. So, if we compute a bunch of pairs simultaneously then that might bring performance improvements. However, if we do computations for each pair of vectors from XA and XB then a minimum of o(num_rowsA * num_rowsB) will happen so only scope for optimization is to reduce the O(num_cols) per unit of time. For example divide the vector of size num_cols into a bunch of parts (say n), then compute these n results simultaneously and then reduce these n results to a single value. However, for very small value of num_cols, the overhead will be huge so there will be performance degradation. What are your thoughts, @ilayn ?

ilayn commented 2 years ago

Complexity computations are always tricky because you are missing the constants. Unfortunately even the order of operations matter hence we can't generalize these estimates to actual implementations. Obviously at the lower limit there will be a point where linalg.norm(A-B) will start beating any implementation in terms of performance at some point. But that is something we should ignore since then you don't need to use cdist but for large problems https://hal.archives-ouvertes.fr/hal-02047514/document (Section 2) can give you a better idea of how matrix product idea is implemented. The blocking/chunking is done over those products for which Level-3 operations are available.

I didn't do an comprehensive search this is the second that came out of scholar search for EDM problem. There are obviously gazillions of algorithms on the subject. What I am hoping to emphasize is that it is better to chunk operations that give way to blocked operations and not chunking essentially Level-1 ops.

ilayn commented 2 years ago

Here is another Python level implementation to illustrate the basic idea (and apparently suffering from overflows in the accepted answer)

https://stackoverflow.com/questions/64952027/compute-l2-distance-with-numpy-using-matrix-multiplication

peterbell10 commented 2 years ago

nrm2 isn't directly helpful because it calculates the euclidean norm, not distance. We would need to first write the intermediate difference vector into memory and pass that to nrm2 which could introduce a lot of overhead. Other changes like (x-y)^2 => x^2 + y^2 - 2xy have come up before but since they're a trade-off for loss of precision, I don't think it's suitable for a general purpose library.

Adding vectorization directly to our distance calculation code would be reasonable, we just need to figure out runtime cpu feature detection so the code can fallback to a non-vectorized version on older CPUs.

ilayn commented 2 years ago

The difference has to be calculated regardless hence the working array is unavoidable.

Same precision issue is also present in the square of the difference anyhow. So that trade-off is about which case is sacrificed since ||(x-y)|| is also quite bad when x and y are very close to each other.

If mostly things are done column-wise, then vectorization won't shave off too much performance due to cache-misses and should work on the row direction in case this is Cythonized or some other scheme is implemented.

peterbell10 commented 2 years ago

The difference has to be calculated regardless hence the working array is unavoidable.

An intermediate calculation step is required, but this result remains entirely within CPU registers; whereas calling dnrm2 would require O(k) additional memory for k-dimensional distances and O(N) additional memory accesses.

Same precision issue is also present in the square of the difference anyhow. So that trade-off is about which case is sacrificed since ||(x-y)|| is also quite bad when x and y are very close to each other.

When x and y are close to each other the limitation is in the representation of x and y as floating point values which is unavoidable and completely outside of the difference calculation itself. On the other hand, representing x^2 and y^2 introduces additional and unnecessary imprecision, over and above that inherent to floating point representations of the input. These are not the same.

ilayn commented 2 years ago

When the difference x-y is small, elementwise abs calculation also loses precision in squaring. Squaring precision loses on the other end of the spectrum when x or y is large. It's a matter of tradeoff. For moderate values the precision loss is not going to affect.

If you are doing a running total of difference then you are back at the original problem given in OP that is you have to run over columns which will have cache misses if the array is C-contiguous. So performance will be affected. O(n) memory access on a contiguous memory is not nearly bad and will give a chance to harden for under-/overflow cases which is what ?nrm does. But for substantial performance increase a not so straightforward blocking scheme is needed.

peterbell10 commented 2 years ago

When the difference x-y is small, elementwise abs calculation also loses precision in squaring.

If (x-y)^2 is small, then x^2+y^2 -2xy is going to be equally small and suffer the exact same rounding error as a result of representing it as a float.

If you are doing a running total of difference then you are back at the original problem given in OP that is you have to run over columns which will have cache misses if the array is C-contiguous. So performance will be affected.

A C-contiguous array will have each point vector as a contiguous sub-array, which means perfectly linear data access.

ilayn commented 2 years ago

If (x-y)^2 is small, then x^2+y^2 -2xy is going to be equally small and suffer the exact same rounding error as a result of representing it as a float.

Yes that's why precision loss is not the issue here. Hence blocking is not going to introduce any loss.

C arrays are row aligned. So if you are going through columns that's typical cache miss situation. C contiguousness requires you run over the rows, in Fortran it's columns. If you have vectors in a NumPy array chances are your point vectors are along columns but transposed at the input (which is typically the case for the users).

The point here is blocked computation not memory access speed. But anyways I am repeating myself I'll leave it to the residents.

scipy / scipy