scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
12.95k stars 5.16k forks source link

RFC: spatial: distance computation kernels #21645

Open fancidev opened 6 days ago

fancidev commented 6 days ago

This issue reviews some of the performance gaps of the distance functions in scipy.spatial.distance, and proposes a “plug-in” mechanism aimed at making it easier to address these gaps by delegating the core computation to “distance computation kernels”.

Background

Performance (i.e. speed) has been a key consideration for implementing the distance functions (also referred to as “metrics”) in scipy.spatial.distance. The current approach is to implement the metrics in C++ and expose them to Python (using pybind11) via the cdist and pdist functions. The work in progress is tracked by the following issues and PRs:

Performance is evaluated by a set of benchmark tests. For each metric, four benchmarks are defined, each with three workload configurations:

Note that all workloads cover exclusively vectors with three elements.

The benchmark results are shown on this website. Each benchmark is presented as a chart that plots its run-time (lower is better) over the history of commits. The charts confirm that the distance implementations have performance improved over time.

Challenges

While the current C++ implementation achieves excellent performance for the types of workload it covers, there remain scenarios where it falls short of expectation.

The main performance gap appears to be with large input. For example:

There are also performance issues raised for the Python metric implementations:

While these issues are raised with the Python implementation, an implementation in C++ would likely be more performant, and would certainly reduce code duplication and potential inconsistencies.

Alternatives

Some libraries provide specialized distance functions that improve the performance for certain workloads.

sklearn.metrics provides specialized implementations for euclidean distance and cosine distance, among others. For euclidean_distance they employ a faster formula that saves a constant proportion of instructions and takes advantage of BLAS matrix multiplication for cross distance. The caveat is that the formula is prone to catastrophic cancellation for inputs that varies wildly in scale. For cosine_similarity they use BLAS matrix multiplication for cross distance after normalizing the inputs. The limitation is that the technique can’t readily be applied to pdist type of query.

MATLAB’s pdist function provides the “fasteuclidean” metric as an alternative to “euclidean”. It computes the Euclidean distance in the same way that sklearn does, and therefore shares the same caveat described above.

SimSIMD provides SIMD-accelerated vector-vector distance implementations of Euclidean, cosine, and a few other metrics, for various data types including f16 and bf16. Cross distance computation is implemented by double looping.

Compared to the above alternatives, SciPy offers a few advantages:

Given SciPy’s rich features, it seems impractical for an alternative to fully substitute SciPy’s implementation. Likewise, it seems impractical for SciPy’s implementation to suit all types of workloads, given the diverse requirements. It probably makes the most sense to combine the best of both — that is the idea of this proposal.

Proposal

The goal is to enable the user to access specialized distance implementations via SciPy’s API, so that they can take advantage of SciPy’s rich ecosystem while at the same time fulfilling specific performance requirements for certain use cases.

To achieve this, I propose to split the core computation logic of a metric into “distance computation kernels” with one of the following signatures:

typedef void InType;  // for exposition
typedef void OutType; // for exposition

// Compute the distance between two vectors
typedef int (*DistanceScalar)(
    const InType *x,  // input n-vector
    const InType *y,  // input n-vector
    size_t n,         // number of elements per vector
    OutType *out      // output scalar
);

// Compute the distance between corresponding row vectors in x and y
typedef int (*DistanceVector)(
    const InType *x,  // input m-by-n matrix (row major)
    const InType *y,  // input m-by-n matrix (row major)
    size_t n,         // number of elements per vector
    size_t m,         // number of vectors
    OutType *out      // output m-vector
);

// Compute the distance between every row vector of x versus every row vector of y
typedef int (*DistanceMatrix)(
    const InType *x,  // input mx-by-n matrix (row major)
    const InType *y,  // input my-by-n matrix (row major)
    size_t n,         // number of elements per vector
    size_t mx,        // number of vectors in x
    size_t my,        // number of vectors in y
    OutType *out      // output mx-by-my matrix (row major)
);

A kernel may also have one of the following “extended” signatures to support strided input/output and extra metric arguments (such as weight).

Extended signatures ```C /* All strides are expressed in number of elements (not bytes) */ typedef int (*DistanceScalarEx)( const InType *x, // input n-vector const ssize_t xs[1], // stride of x const InType *y, // input n-vector const ssize_t ys[1], // stride of y size_t n, // number of elements per vector OutType *out, // output scalar const OutType *const *args, // additional arguments const ssize_t *const *strs // stride of additional arguments ); typedef int (*DistanceVectorEx)( const InType *x, // input m-by-n matrix const ssize_t xs[2], // strides of x const InType *y, // input m-by-n matrix const ssize_t ys[2], // strides of y size_t n, // number of elements per vector size_t m, // number of vectors const OutType *out, // output m-vector const ssize_t outs[1], // stride of out const OutType *const *args, // additional arguments const ssize_t *const *strs // stride of additional arguments ); typedef int (*DistanceMatrixEx)( const InType *x, // input mx-by-n matrix const ssize_t xs[2], // stride of x const InType *y, // input my-by-n matrix const ssize_t ys[2], // stride of y size_t n, // number of elements per vector size_t mx, // number of vectors in x size_t my, // number of vectors in y OutType *out, // output mx-by-my matrix const ssize_t outs[2] , // stride of out const OutType *const *args, // additional arguments const ssize_t *const *strs // stride of additional arguments ); ```

Remark. The caller must ensure that (1) the input, output, and extra arguments are properly aligned, (2) the byte stride is a multiple of the element size, and (3) the output buffer does not overlap with any other arguments (i.e. allowing all pointers to be “restrict” qualified).

A backend would provide kernel implementations (with the above signature) for combinations of metrics and data types. SciPy would serve as the front-end to transform a user call to one or more calls to the kernel, after performing preprocessing such as type promotion. To make a kernel available for use by SciPy, the backend would call

scipy.spatial.distance.add_distance_kernel(
    metric_spec: str,         # e.g. 'cosine.s', 'euclidean.v:', 'mahanalobis.m:w.m', 'minkowski.m:w.v:p.s'
    dtype_spec: str,       # e.g. 'f64', 'f32->f64'
    kernel: LowLevelCallable,   # wraps a C function pointer
    backend: str,  # name if the backend
)

Here metric_spec includes the metric name, dimension, striding, and the name and dimension of additional arguments to be passed to the kernel, separated by colon. A standalone colon following the metric name indicates stride support. The dimension of out and the additional arguments is specified by the single letter s (scalar), v (n-vector), or m (n-by-n matrix). dtype_spec specifies the data type of arguments: x and y will have the input_dtype, and out and additional arguments will have the output_dtype.

SciPy would keep track of the kernels for each (metric_spec, dtype_spec) combination. The stock implementations are registered at start-up. When distance computation is requested, the best-match kernel is selected to perform the computation. If no match exists, SciPy would attempt type promotion and possibly other transforms to find a usable kernel.

For an end user to “activate” a specialized metric implementation, the minimum they have to do is to add one line anywhere before calling the distance function, e.g.

import fast_euclidean

...

d = cdist(x, y, 'euclidean') # as usual

This assumes that some fast_euclidean package registers a fast euclidean kernel for the “euclidean” metric and activates itself upon module import. (Though not shown here, it is also possible to install metrics not recognized by SciPy.)

Discussion

How would the proposal help address the challenges listed above?

If the user has a (performance or functional) requirement not satisfied by SciPy, they could install a custom backend with the desired performance / functionality and use it seamlessly with SciPy. No need to wait for SciPy’s implementation (which can take time) or resort to other packages (which can be tricky to integrate).

Broadcasted distance could be easily implemented by calling the PairedDistance kernel. (In fact, all the stock C++ metric implementations are essentially wrapping PairedDistance kernels.)

How does the proposal interact with the existing distance computation infrastructure in SciPy?

The existing implementation of SciPy remains largely intact, except that some refactoring would be needed to isolate the core distance computation code from input preprocessing.

I expect SciPy's stock implementations to continue to improve for general use.

Is the proposed mechanism backward compatible?

Yes, the proposal is fully backward compatible for users of SciPy.

Request for comments

I’d love to hear your feedback on this proposal. Do you think it addresses a valid problem? Do you think the solution makes sense? Do you think it might attract interest to vendor a backend? Any thoughts about the technical details (such as the kernel signature)? All sorts of comments are welcome!

@peterbell10 @czgdp1807 @tylerjereddy @ashvardanian @thomasjpfan

czgdp1807 commented 5 days ago

This is interesting actually. I will go through your comment and infact would like to take this up. Let's meet on call to discuss this (if its okay to meet).

cc: @rgommers

lucascolley commented 5 days ago

There is overlap between the kernel dispatching discussed here and past work on Uarray: https://github.com/scipy/scipy/issues/14353. Uarray work seems to have paused so it may be wise to hold fire before we start reinventing the wheel. See also https://github.com/scientific-python/spatch/issues/1.

fancidev commented 5 days ago

There is overlap between the kernel dispatching discussed here and past work on Uarray: #14353. Uarray work seems to have paused so it may be wise to hold fire before we start reinventing the wheel. See also scientific-python/spatch#1.

Thanks for the links! They are interesting to read.

I think this proposal is quite different from the referenced ones in that it is (1) less ambitious, and (2) more concrete.

For (1), the proposal deals exclusively with dense arrays stored in main memory and eagerly computed on the CPU by a single thread. This is exactly what the current distance computation code in SciPy handles. The limited scope makes the proposal an incremental enhancement rather than a complete rewrite, making it easier to materialize. For example, GPU computation or alternative array backends are completely out of the scope of this proposal.

For (2), the proposal addresses a specific issue — performance of distance computation (for dense arrays on the CPU). To optimize performance, specialized routines are needed for each combination of (metric, CPU, data type); the number of combinations can easily reach hundreds and become unmanageable by the limited developer resource of SciPy. In fact, SciPy’s distance functions currently only support f64 input, which might have rendered themselves irrelevant for applications that operate on large amount of low precision data. This proposal makes an attempt to make at least the API relevant.

It might be helpful to compare the proposed kernel interface to BLAS — both define a concrete set of operations, for which efficient implementations are provided by some backend and consumed by some frontend.

lorentzenchr commented 4 days ago

@jjerphan @Micky774 your inout might be valuable.

I would recommend to invest some effort into an analysis of similarities and difference between scipy.spatial.distance and sklearn.metrics.pairwise. No need to have 2 largely overlapping implementations of the same functionality so close in the ecosystem.

Then about performance: The 3 main drivers are cache efficiency, use of simd, use of multithreading.

jjerphan commented 4 days ago

Thank you for the heads-up, @lorentzenchr.

This RFC is indeed valuable since there's some overlap in the ecosystem and unaddressed issues for distance metrics.

I am genuinely interested in participating actively to the discussions, but I lack time for now (I have other commitments which I want to honor first). I can provide a few pointers for issues or PRs on scikit-learn and participate passively when I get some time.

jjerphan commented 4 days ago

A list of a few elements which I might update from time to time:

fancidev commented 3 days ago

Thanks for the comments!

No need to have 2 largely overlapping implementations of the same functionality so close in the ecosystem.

From my limited knowledge about both libraries, it appears that scipy’s distance functions focus more on 3D vectors, while sklearn’s functions focus more on large vectors with hundreds to thousands of elements. Another major feature of sklearn is the support for sparse matrix.

Then about performance: The 3 main drivers are cache efficiency, use of simd, use of multithreading.

While this RFC is not about how to improve speed itself (but about how to make it possible to plug in performant implementations), I could share some color about the current performance characteristics of scipy’s implementation.

Cache efficiency seems ok as data are mostly read and written in serial (no jumping around). For long vectors (of hundreds to thousands elements) the manual unrolling that SciPy implements may read from up to 5 (6 if weighted) pages at the same time. Depending on the hardware and system load this may improve or degrade performance. Light experiments on my own computer appears to suggest it still improves performance.

SIMD is not explicitly coded but relies on compiler auto-vectorization. I’m not sure SciPy has the resource to hand-tune SIMD intrinsics. Light experiments suggest that there’s still room to adjust the compiler flags to improve performance significantly without sacrificing accuracy. Separately, dynamic dispatching might be more distribution friendly, and e.g. SimSIMD already does that.

Multi-threading can be implemented at different levels, and it may be tricky to coordinate. An obvious candidate is OpenMP, but a quick search on GitHub Issues reveals a few practical obstacles. SciPy’s distance functions are single threaded (but thread safe) and I’m not aware of any plan to make them threaded.

rgommers commented 3 days ago

Thanks @fancidev! Overall the motivation and ideas for a plugin infrastructure look good to me. It would be great to be able to have a good interface for integration of a package like SimSIMD.

one or more of the following C interfaces

Regarding the signatures: the most important choice that jumps out to me is that you're choosing to include strides. It may be easier and better for performance for another library to only support contiguous arrays. JAX and Pythran are two examples of packages that made this choice. So it's worth thinking about whether you'd have a contiguous-only interface either in addition to or instead of the strided one.

To achieve this, I propose to split the core computation logic of a metric into “distance computation kernels” that implement one or more of the following C interfaces: [...] To make a kernel available for use by SciPy, the backend would “register” the (metric, in_dtype, out_dtype) combinations it supports by calling the following:

Did you prototype this mechanism? Some important details of how this could work are missing, for example you can't just pass C function pointers to a Python-level registration mechanism. I suspect it should use scipy.LowLevelCallable or something similar to it. The closest analog to it we have is http://scipy.github.io/devdocs/tutorial/ndimage.html#extending-scipy-ndimage-in-c.

fancidev commented 3 days ago

Thanks for the comments @rgommers. They're very good points!

So it's worth thinking about whether you'd have a contiguous-only interface either in addition to or instead of the strided one.

My initial objective was to fully support the current SciPy behavior. The current SciPy distance implementation supports arbitrary strides between rows and between columns. This ensures zero copy as long as the elements are aligned, though the performance can be suboptimal if the strides are not one.

I made a compromise between generality and performance by including only "row strides" in the signature -- the elements within each vector are still contiguous. I find that the BLAS routines ?GEMM and ?GEMV both accept a stride, so I assume including such a stride has no impact on performance. In addition, SimSIMD uses a double loop to implement cdist, and supporting row strides is "free".

But this indeed leaves the interface in an awkward position where it has redundant arguments for the best case (fully contiguous input), and yet it does not support zero-copy in the worst case (where input data has non-zero column stride).

So I'm going to rearrange the signatures into a "simple” interface that assumes fully contiguous input/output and a “extended” interface" that supports arbitrary strided inputs as well as extra arguments. Performance oriented implementations only need to support the "simple” interface. I guess that should be good enough for most real-world performance sensitive tasks.

Did you prototype this mechanism? Some important details of how this could work are missing, for example you can't just pass C function pointers to a Python-level registration mechanism. I suspect it should use scipy.LowLevelCallable or something similar to it. The closest analog to it we have is http://scipy.github.io/devdocs/tutorial/ndimage.html#extending-scipy-ndimage-in-c.

Indeed that "registration" code segment is a sketch. I wasn't aware of scipy.LowLevelCallable, let me look into it.

What I did in a prototype was: (1) build SimSIMD into a pure-C DLL (without Python binding), (2) use ctypes in Python to load the DLL, (3) load the exported C function symbol as a ctypes function, and (4) use numpy.ctypeslib.as_ctypes to call the function with NumPy array input. This works, but it is clearly quite far from a production grade solution.

ashvardanian commented 3 days ago

@fancidev, for context, SimSIMD exposes the function pointers into the Python API as well, and it may be compatible with scipy.LowLevelCallable, mentioned by @rgommers:

from simsimd import pointer_to_sqeuclidean, pointer_to_cosine, pointer_to_inner

I chose vanilla integers (to cast pointers) as opposed to Python capsules for usability, but I can add alternative interfaces as well.

tylerjereddy commented 2 days ago

Seems like an interesting proposal. I agree with above comments about trying to coordinate with sklearn where reasonable, and of course there's the usual concern about extra complexity/maintenance burden.

I make pretty heavy use of pdist/cdist and I suspect many others do too, so keeping those the same (or better) is important I think.

I do wonder if there's any concern about the sample call signature above d = cdist(x, y, 'euclidean') being indistinguishable from the "native SciPy" call signature with respect to reasoning/clarity. For example, is there merit to having the "explicit is better than implicit" philosophy on the name of the metric at least telling us that it has an external provider like euclidean_tpl for the third party lib or something like that.

fancidev commented 2 days ago

I chose vanilla integers (to cast pointers) as opposed to Python capsules for usability, but I can add alternative interfaces as well.

Thanks for the heads up! I’m working on a prototype to try this out.

For example, is there merit to having the "explicit is better than implicit" philosophy on the name of the metric at least telling us that it has an external provider like euclidean_tpl for the third party lib or something like that.

Good point! I’m going to prototype a “new_cdist” that takes an optional backend argument that allows the user to explicitly specify a backend. The backend may also use a metric name like euclidean_tpl if it choose to.

Micky774 commented 1 day ago

Thank you for taking the time to create this discussion! I'm definitely +1 for a unified plug-in mechanism. It definitely is the better "upstream" method to implement a lot of the work I've tried nudging towards in scikit-learn while compartmentalizing the complexity.

I'm wary of how we handle the extra args, especially for cases where you have multiple extra args for a common function implemented via multiple backends, yet the order of the args are permuted in the different backends. If an implementation doesn't thoroughly validate its extra args (which IMO should be a python-level task not a C-level one) then this could cause issues that are potentially opaque to end users, especially since one can simply say it was user error for assuming an incorrect API. Most notable, it would prevent a real "drop-in" solution.

This would mainly be an issue, I imagine, in areas where SciPy doesn't have an established API (e.g. new metrics / kernels) and multiple libraries implement differing APIs for the same semantic function. This isn't a huge area right now, but w/ the adoption of this framework, it may be more likely to be an issue later. It may just have to be a "buyer beware" situation where folks should be aware to individually validate the APIs of the backends they use, in case they differ from what they're expecting.

With that being said, the presence of the void* args would allow for "functional" or "fused" kernels which complete separate computation in a way that is local to the specific distance calculations, optimizing for cache efficiency, memory movement, and vtable lookups -- scikit-learn leverages this quite heavily in some of its most performance-critical sections thanks to @jjerphan's system. This wouldn't be something end users would be expected to care about -- or see in the first place -- but it would be a nice tool for library developers.

TL;DR

lucascolley commented 1 day ago

It may just have to be a "buyer beware" situation where folks should be aware to individually validate the APIs of the backends they use, in case they differ from what they're expecting.

Yeah, this is one of those aspects that is on the level of generality of dispatching ideas like spatch I linked to above. I guess it would be nice to try to be consistent with any efforts like that.

ashvardanian commented 1 day ago

I think it's common for power users to encounter "buyer beware" situations, as seen with SciPy and NumPy, where performance varies based on the detected BLAS implementation.


This proposal opens an opportunity to refactor SciPy’s distances module. There are several issues I've noticed in the last months, and I'm curious if the community agrees.

  1. Weighted Functions Usage:
    Weighted variants are conceptually interesting, but rarely used. This also complicates the API since some distances, like Jensen-Shannon and Mahalanobis, don’t support weighting.

  2. Irrelevant Kernels:
    Many documented distance functions, such as Kulczynski1 and Sokal-Michener, are unpopular. There are overlaps, like cosine calling correlation. Also, euclidean is a trivial extension of sqeuclidean distance. Supporting all of those variants across all SIMD generations is troublesome. Smaller palette of more optimized kernels may be a good idea.

  3. Mixed Precision:
    SciPy kernels operating on small data types (i8, i16, etc.) may overflow. Upcasting inputs can avoid errors without breaking the API. Implementing mixed-precision optimizations could improve robustness.

  4. Signature and Input Representation:
    Python lacks descriptors for types like bf16 or compressed bit-matrix formats (np.packbits), which are useful for boolean distances. SciPy's reliance on integers for booleans limits its performance. A dtype= option for specialized data would help address this.

tylerjereddy commented 1 day ago

Many documented distance functions, such as Kulczynski1 and Sokal-Michener, are unpopular.

Some potential good news there, both were deprecated in gh-21572, and we coordinated with sklearn such that they may also follow suit.

fancidev commented 1 day ago

Thanks for the comments @Micky774 !

I'm wary of how we handle the extra args, especially for cases where you have multiple extra args for a common function implemented via multiple backends, yet the order of the args are permuted in the different backends.

What I have in mind is to (1) let the backend declare the order, name, and dimension of additional parameters it supports, (2) let the user call cdist passing additional arguments as keyword-only, and (3) let the front end validate the user supplied arguments.

To prevent “combinatorial explosion”, I do require additional parameters to (1) have the same dtype as out (converting user input if necessary), and (2) have shape (), (n,), or (n,n) where n is the vector dimension.

I would be interested in seeing whether existing scikit-learn fused kernels can be re-contextualized in terms of the proposed API.

I’m not too familiar with scikit-learn. Could you elaborate a bit what is “fused kernels”?

fancidev commented 20 hours ago

Those are interesting points! @ashvardanian

Weighted variants are conceptually interesting, but rarely used.

I too guess they’re not widely used. At least I’ve not see a weight parameter in the distance function of Scikit-Learn, MATLAB, Mathematica, STATA.

This also complicates the API since some distances, like Jensen-Shannon and Mahalanobis, don’t support weighting.

The backend doesn’t have to support weighting if it doesn’t bother to. Though technically, one could define weights for Jensen-Shannon reasonably, and the inverse covariance matrix of Mahalanobis could be considered a weight matrix itself (in other words, Mahalanobis could be interpreted as matrix-weighted Euclidean).

Also, euclidean is a trivial extension of sqeuclidean distance.

I think euclidean is more subtle than sqeuclidean, because the obvious calculation (taking square root of sum of squares) may give an overflow result when the correct result does not overflow, and a more careful calculation that avoids this problem is likely slower. This is the kind of trade-off that a backend has got to offer for the “power user”.

Supporting all of those variants across all SIMD generations is troublesome. Smaller palette of more optimized kernels may be a good idea.

I agree!

SciPy kernels operating on small data types (i8, i16, etc.) may overflow.

I believe SciPy upcasts all integral input to float64 before calculation. There is room for performance improvement if the input is integral (e.g. SimSIMD can compute cosine distance from i8 input). Some details need to be worked out for the (to-be) front-end with respect to input type conversion.

Python lacks descriptors for types like bf16 or compressed bit-matrix formats (np.packbits), which are useful for boolean distances

Some work can be done by the front-end to “recognize” bf16 or bitmask input. For example, one might call pdist(BFloat16(x)) to let the front-end “interpret” a 16-bit array as containing bf16 numbers. Similar enhancements could come as follow ups.

fancidev commented 5 hours ago

I added a prototype (gh-21669) to illustrate how the proposed interfaces might look like in practice.

The minimal working code (from the Python side) to register and use a kernel is the following:

import scipy
import scipy.spatial.distance as sd
capsule = sd._distance_wrap.cosine_DistanceMatrix_capsule()
kernel = scipy.LowLevelCallable(capsule)
sd.add_distance_kernel('cosine.m', 'f64', kernel, 'alt')
sd.cdist([[1, 2, 3]], [[4, 5, 6]], 'new_cosine', backend='alt')

The (backend) code on the C side involves a simple wrapper to make the kernel conform to the proposed interface, plus a simple wrapper to export the function to Python. The second wrapper is unnecessary if the backend is built as a dynamic library that exports the function; a few more lines of Python code would be needed in this case.

It could be helpful to check the rendered docstring of add_distance_kernel to get more details on how to register a distance computation kernel.