Open fancidev opened 6 days ago
This is interesting actually. I will go through your comment and infact would like to take this up. Let's meet on call to discuss this (if its okay to meet).
cc: @rgommers
There is overlap between the kernel dispatching discussed here and past work on Uarray: https://github.com/scipy/scipy/issues/14353. Uarray work seems to have paused so it may be wise to hold fire before we start reinventing the wheel. See also https://github.com/scientific-python/spatch/issues/1.
There is overlap between the kernel dispatching discussed here and past work on Uarray: #14353. Uarray work seems to have paused so it may be wise to hold fire before we start reinventing the wheel. See also scientific-python/spatch#1.
Thanks for the links! They are interesting to read.
I think this proposal is quite different from the referenced ones in that it is (1) less ambitious, and (2) more concrete.
For (1), the proposal deals exclusively with dense arrays stored in main memory and eagerly computed on the CPU by a single thread. This is exactly what the current distance computation code in SciPy handles. The limited scope makes the proposal an incremental enhancement rather than a complete rewrite, making it easier to materialize. For example, GPU computation or alternative array backends are completely out of the scope of this proposal.
For (2), the proposal addresses a specific issue — performance of distance computation (for dense arrays on the CPU). To optimize performance, specialized routines are needed for each combination of (metric, CPU, data type); the number of combinations can easily reach hundreds and become unmanageable by the limited developer resource of SciPy. In fact, SciPy’s distance functions currently only support f64
input, which might have rendered themselves irrelevant for applications that operate on large amount of low precision data. This proposal makes an attempt to make at least the API relevant.
It might be helpful to compare the proposed kernel interface to BLAS — both define a concrete set of operations, for which efficient implementations are provided by some backend and consumed by some frontend.
@jjerphan @Micky774 your inout might be valuable.
I would recommend to invest some effort into an analysis of similarities and difference between scipy.spatial.distance
and sklearn.metrics.pairwise
. No need to have 2 largely overlapping implementations of the same functionality so close in the ecosystem.
Then about performance: The 3 main drivers are cache efficiency, use of simd, use of multithreading.
Thank you for the heads-up, @lorentzenchr.
This RFC is indeed valuable since there's some overlap in the ecosystem and unaddressed issues for distance metrics.
I am genuinely interested in participating actively to the discussions, but I lack time for now (I have other commitments which I want to honor first). I can provide a few pointers for issues or PRs on scikit-learn and participate passively when I get some time.
A list of a few elements which I might update from time to time:
Thanks for the comments!
No need to have 2 largely overlapping implementations of the same functionality so close in the ecosystem.
From my limited knowledge about both libraries, it appears that scipy’s distance functions focus more on 3D vectors, while sklearn’s functions focus more on large vectors with hundreds to thousands of elements. Another major feature of sklearn is the support for sparse matrix.
Then about performance: The 3 main drivers are cache efficiency, use of simd, use of multithreading.
While this RFC is not about how to improve speed itself (but about how to make it possible to plug in performant implementations), I could share some color about the current performance characteristics of scipy’s implementation.
Cache efficiency seems ok as data are mostly read and written in serial (no jumping around). For long vectors (of hundreds to thousands elements) the manual unrolling that SciPy implements may read from up to 5 (6 if weighted) pages at the same time. Depending on the hardware and system load this may improve or degrade performance. Light experiments on my own computer appears to suggest it still improves performance.
SIMD is not explicitly coded but relies on compiler auto-vectorization. I’m not sure SciPy has the resource to hand-tune SIMD intrinsics. Light experiments suggest that there’s still room to adjust the compiler flags to improve performance significantly without sacrificing accuracy. Separately, dynamic dispatching might be more distribution friendly, and e.g. SimSIMD already does that.
Multi-threading can be implemented at different levels, and it may be tricky to coordinate. An obvious candidate is OpenMP, but a quick search on GitHub Issues reveals a few practical obstacles. SciPy’s distance functions are single threaded (but thread safe) and I’m not aware of any plan to make them threaded.
Thanks @fancidev! Overall the motivation and ideas for a plugin infrastructure look good to me. It would be great to be able to have a good interface for integration of a package like SimSIMD.
one or more of the following C interfaces
Regarding the signatures: the most important choice that jumps out to me is that you're choosing to include strides. It may be easier and better for performance for another library to only support contiguous arrays. JAX and Pythran are two examples of packages that made this choice. So it's worth thinking about whether you'd have a contiguous-only interface either in addition to or instead of the strided one.
To achieve this, I propose to split the core computation logic of a metric into “distance computation kernels” that implement one or more of the following C interfaces: [...] To make a kernel available for use by SciPy, the backend would “register” the (metric, in_dtype, out_dtype) combinations it supports by calling the following:
Did you prototype this mechanism? Some important details of how this could work are missing, for example you can't just pass C function pointers to a Python-level registration mechanism. I suspect it should use scipy.LowLevelCallable
or something similar to it. The closest analog to it we have is http://scipy.github.io/devdocs/tutorial/ndimage.html#extending-scipy-ndimage-in-c.
Thanks for the comments @rgommers. They're very good points!
So it's worth thinking about whether you'd have a contiguous-only interface either in addition to or instead of the strided one.
My initial objective was to fully support the current SciPy behavior. The current SciPy distance implementation supports arbitrary strides between rows and between columns. This ensures zero copy as long as the elements are aligned, though the performance can be suboptimal if the strides are not one.
I made a compromise between generality and performance by including only "row strides" in the signature -- the elements within each vector are still contiguous. I find that the BLAS routines ?GEMM and ?GEMV both accept a stride, so I assume including such a stride has no impact on performance. In addition, SimSIMD uses a double loop to implement cdist
, and supporting row strides is "free".
But this indeed leaves the interface in an awkward position where it has redundant arguments for the best case (fully contiguous input), and yet it does not support zero-copy in the worst case (where input data has non-zero column stride).
So I'm going to rearrange the signatures into a "simple” interface that assumes fully contiguous input/output and a “extended” interface" that supports arbitrary strided inputs as well as extra arguments. Performance oriented implementations only need to support the "simple” interface. I guess that should be good enough for most real-world performance sensitive tasks.
Did you prototype this mechanism? Some important details of how this could work are missing, for example you can't just pass C function pointers to a Python-level registration mechanism. I suspect it should use
scipy.LowLevelCallable
or something similar to it. The closest analog to it we have is http://scipy.github.io/devdocs/tutorial/ndimage.html#extending-scipy-ndimage-in-c.
Indeed that "registration" code segment is a sketch. I wasn't aware of scipy.LowLevelCallable
, let me look into it.
What I did in a prototype was: (1) build SimSIMD into a pure-C DLL (without Python binding), (2) use ctypes
in Python to load the DLL, (3) load the exported C function symbol as a ctypes
function, and (4) use numpy.ctypeslib.as_ctypes
to call the function with NumPy array input. This works, but it is clearly quite far from a production grade solution.
@fancidev, for context, SimSIMD exposes the function pointers into the Python API as well, and it may be compatible with scipy.LowLevelCallable
, mentioned by @rgommers:
from simsimd import pointer_to_sqeuclidean, pointer_to_cosine, pointer_to_inner
I chose vanilla integers (to cast pointers) as opposed to Python capsules for usability, but I can add alternative interfaces as well.
Seems like an interesting proposal. I agree with above comments about trying to coordinate with sklearn where reasonable, and of course there's the usual concern about extra complexity/maintenance burden.
I make pretty heavy use of pdist
/cdist
and I suspect many others do too, so keeping those the same (or better) is important I think.
I do wonder if there's any concern about the sample call signature above d = cdist(x, y, 'euclidean')
being indistinguishable from the "native SciPy" call signature with respect to reasoning/clarity. For example, is there merit to having the "explicit is better than implicit" philosophy on the name of the metric at least telling us that it has an external provider like euclidean_tpl
for the third party lib or something like that.
I chose vanilla integers (to cast pointers) as opposed to Python capsules for usability, but I can add alternative interfaces as well.
Thanks for the heads up! I’m working on a prototype to try this out.
For example, is there merit to having the "explicit is better than implicit" philosophy on the name of the metric at least telling us that it has an external provider like
euclidean_tpl
for the third party lib or something like that.
Good point! I’m going to prototype a “new_cdist” that takes an optional backend
argument that allows the user to explicitly specify a backend. The backend may also use a metric name like euclidean_tpl
if it choose to.
Thank you for taking the time to create this discussion! I'm definitely +1 for a unified plug-in mechanism. It definitely is the better "upstream" method to implement a lot of the work I've tried nudging towards in scikit-learn while compartmentalizing the complexity.
I'm wary of how we handle the extra args, especially for cases where you have multiple extra args for a common function implemented via multiple backends, yet the order of the args are permuted in the different backends. If an implementation doesn't thoroughly validate its extra args (which IMO should be a python-level task not a C-level one) then this could cause issues that are potentially opaque to end users, especially since one can simply say it was user error for assuming an incorrect API. Most notable, it would prevent a real "drop-in" solution.
This would mainly be an issue, I imagine, in areas where SciPy doesn't have an established API (e.g. new metrics / kernels) and multiple libraries implement differing APIs for the same semantic function. This isn't a huge area right now, but w/ the adoption of this framework, it may be more likely to be an issue later. It may just have to be a "buyer beware" situation where folks should be aware to individually validate the APIs of the backends they use, in case they differ from what they're expecting.
With that being said, the presence of the void*
args would allow for "functional" or "fused" kernels which complete separate computation in a way that is local to the specific distance calculations, optimizing for cache efficiency, memory movement, and vtable lookups -- scikit-learn leverages this quite heavily in some of its most performance-critical sections thanks to @jjerphan's system. This wouldn't be something end users would be expected to care about -- or see in the first place -- but it would be a nice tool for library developers.
TL;DR
void*
argsIt may just have to be a "buyer beware" situation where folks should be aware to individually validate the APIs of the backends they use, in case they differ from what they're expecting.
Yeah, this is one of those aspects that is on the level of generality of dispatching ideas like spatch
I linked to above. I guess it would be nice to try to be consistent with any efforts like that.
I think it's common for power users to encounter "buyer beware" situations, as seen with SciPy and NumPy, where performance varies based on the detected BLAS implementation.
This proposal opens an opportunity to refactor SciPy’s distances module. There are several issues I've noticed in the last months, and I'm curious if the community agrees.
Weighted Functions Usage:
Weighted variants are conceptually interesting, but rarely used. This also complicates the API since some distances, like Jensen-Shannon and Mahalanobis, don’t support weighting.
Irrelevant Kernels:
Many documented distance functions, such as Kulczynski1 and Sokal-Michener, are unpopular. There are overlaps, like cosine
calling correlation
. Also, euclidean
is a trivial extension of sqeuclidean
distance. Supporting all of those variants across all SIMD generations is troublesome. Smaller palette of more optimized kernels may be a good idea.
Mixed Precision:
SciPy kernels operating on small data types (i8
, i16
, etc.) may overflow. Upcasting inputs can avoid errors without breaking the API. Implementing mixed-precision optimizations could improve robustness.
Signature and Input Representation:
Python lacks descriptors for types like bf16
or compressed bit-matrix formats (np.packbits
), which are useful for boolean distances. SciPy's reliance on integers for booleans limits its performance. A dtype=
option for specialized data would help address this.
Many documented distance functions, such as Kulczynski1 and Sokal-Michener, are unpopular.
Some potential good news there, both were deprecated in gh-21572, and we coordinated with sklearn such that they may also follow suit.
Thanks for the comments @Micky774 !
I'm wary of how we handle the extra args, especially for cases where you have multiple extra args for a common function implemented via multiple backends, yet the order of the args are permuted in the different backends.
What I have in mind is to (1) let the backend declare the order, name, and dimension of additional parameters it supports, (2) let the user call cdist passing additional arguments as keyword-only, and (3) let the front end validate the user supplied arguments.
To prevent “combinatorial explosion”, I do require additional parameters to (1) have the same dtype as out
(converting user input if necessary), and (2) have shape ()
, (n,)
, or (n,n)
where n is the vector dimension.
I would be interested in seeing whether existing scikit-learn fused kernels can be re-contextualized in terms of the proposed API.
I’m not too familiar with scikit-learn. Could you elaborate a bit what is “fused kernels”?
Those are interesting points! @ashvardanian
Weighted variants are conceptually interesting, but rarely used.
I too guess they’re not widely used. At least I’ve not see a weight parameter in the distance function of Scikit-Learn, MATLAB, Mathematica, STATA.
This also complicates the API since some distances, like Jensen-Shannon and Mahalanobis, don’t support weighting.
The backend doesn’t have to support weighting if it doesn’t bother to. Though technically, one could define weights for Jensen-Shannon reasonably, and the inverse covariance matrix of Mahalanobis could be considered a weight matrix itself (in other words, Mahalanobis could be interpreted as matrix-weighted Euclidean).
Also,
euclidean
is a trivial extension ofsqeuclidean
distance.
I think euclidean
is more subtle than sqeuclidean
, because the obvious calculation (taking square root of sum of squares) may give an overflow result when the correct result does not overflow, and a more careful calculation that avoids this problem is likely slower. This is the kind of trade-off that a backend has got to offer for the “power user”.
Supporting all of those variants across all SIMD generations is troublesome. Smaller palette of more optimized kernels may be a good idea.
I agree!
SciPy kernels operating on small data types (
i8
,i16
, etc.) may overflow.
I believe SciPy upcasts all integral input to float64 before calculation. There is room for performance improvement if the input is integral (e.g. SimSIMD can compute cosine distance from i8 input). Some details need to be worked out for the (to-be) front-end with respect to input type conversion.
Python lacks descriptors for types like
bf16
or compressed bit-matrix formats (np.packbits
), which are useful for boolean distances
Some work can be done by the front-end to “recognize” bf16 or bitmask input. For example, one might call pdist(BFloat16(x))
to let the front-end “interpret” a 16-bit array as containing bf16 numbers. Similar enhancements could come as follow ups.
I added a prototype (gh-21669) to illustrate how the proposed interfaces might look like in practice.
The minimal working code (from the Python side) to register and use a kernel is the following:
import scipy
import scipy.spatial.distance as sd
capsule = sd._distance_wrap.cosine_DistanceMatrix_capsule()
kernel = scipy.LowLevelCallable(capsule)
sd.add_distance_kernel('cosine.m', 'f64', kernel, 'alt')
sd.cdist([[1, 2, 3]], [[4, 5, 6]], 'new_cosine', backend='alt')
The (backend) code on the C side involves a simple wrapper to make the kernel conform to the proposed interface, plus a simple wrapper to export the function to Python. The second wrapper is unnecessary if the backend is built as a dynamic library that exports the function; a few more lines of Python code would be needed in this case.
It could be helpful to check the rendered docstring of add_distance_kernel
to get more details on how to register a distance computation kernel.
This issue reviews some of the performance gaps of the distance functions in
scipy.spatial.distance
, and proposes a “plug-in” mechanism aimed at making it easier to address these gaps by delegating the core computation to “distance computation kernels”.Background
Performance (i.e. speed) has been a key consideration for implementing the distance functions (also referred to as “metrics”) in
scipy.spatial.distance
. The current approach is to implement the metrics in C++ and expose them to Python (using pybind11) via the cdist and pdist functions. The work in progress is tracked by the following issues and PRs:bool
input for Boolean metrics. The current C++ implementation convertsbool
todouble
.Performance is evaluated by a set of benchmark tests. For each metric, four benchmarks are defined, each with three workload configurations:
Note that all workloads cover exclusively vectors with three elements.
The benchmark results are shown on this website. Each benchmark is presented as a chart that plots its run-time (lower is better) over the history of commits. The charts confirm that the distance implementations have performance improved over time.
Challenges
While the current C++ implementation achieves excellent performance for the types of workload it covers, there remain scenarios where it falls short of expectation.
The main performance gap appears to be with large input. For example:
gh-16023 reports suboptimal performance of
cdist('euclidean')
forf32
input of size 3360x3072 vs 15904x3072. An alternative formula is suggested in the discussion, but it is prone to catastrophic cancellation and therefore unlikely to be accepted for SciPy.gh-12570 suggests an alternative formula to compute
cdist('mahanalobis')
faster for input of size 500x500 vs 500x500. However, this formula is prone to the same catastrophic cancellation problem and is therefore unlikely to be accepted for SciPy.gh-11280 raises a requirement to compute distance for
f32
andf16
input of length 60K. The original intent was to save memory, but it certainly has a speed impact too.f32
is possible with the current C++ infrastructure (though not yet implemented).f16
seems much more challenging to implement within the current infrastructure.gh-5657 discusses the general use of BLAS operations (dot product and matrix multiplication) for distance computations involving vectors of length 2K to 10K. The discussions seem to be inclusive with no clear action plan.
gh-19454 introduces an SIMD-based library that boosts performance especially for lower-precision input. The benchmarks involve vectors of length 1536.
There are also performance issues raised for the Python metric implementations:
gh-20147 reports performance degrade of
distance.mahanalobis
for input dimension above 100. It seems to be related to an upstream bug.gh-20561 and gh-20570 propose a micro-optimization for
distance.chebyshev
that improves speed for input vector over 40 dimensions.gh-12537, gh-6089 and gh-6088 ask for support of broadcasted, instead of cross, distance. They suggest to add an
axis
parameter to the Python implementations to take advantage of NumPy's vectorization.While these issues are raised with the Python implementation, an implementation in C++ would likely be more performant, and would certainly reduce code duplication and potential inconsistencies.
Alternatives
Some libraries provide specialized distance functions that improve the performance for certain workloads.
sklearn.metrics provides specialized implementations for euclidean distance and cosine distance, among others. For euclidean_distance they employ a faster formula that saves a constant proportion of instructions and takes advantage of BLAS matrix multiplication for cross distance. The caveat is that the formula is prone to catastrophic cancellation for inputs that varies wildly in scale. For cosine_similarity they use BLAS matrix multiplication for cross distance after normalizing the inputs. The limitation is that the technique can’t readily be applied to pdist type of query.
MATLAB’s pdist function provides the “fasteuclidean” metric as an alternative to “euclidean”. It computes the Euclidean distance in the same way that sklearn does, and therefore shares the same caveat described above.
SimSIMD provides SIMD-accelerated vector-vector distance implementations of Euclidean, cosine, and a few other metrics, for various data types including f16 and bf16. Cross distance computation is implemented by double looping.
Compared to the above alternatives, SciPy offers a few advantages:
Given SciPy’s rich features, it seems impractical for an alternative to fully substitute SciPy’s implementation. Likewise, it seems impractical for SciPy’s implementation to suit all types of workloads, given the diverse requirements. It probably makes the most sense to combine the best of both — that is the idea of this proposal.
Proposal
The goal is to enable the user to access specialized distance implementations via SciPy’s API, so that they can take advantage of SciPy’s rich ecosystem while at the same time fulfilling specific performance requirements for certain use cases.
To achieve this, I propose to split the core computation logic of a metric into “distance computation kernels” with one of the following signatures:
A kernel may also have one of the following “extended” signatures to support strided input/output and extra metric arguments (such as weight).
Extended signatures
```C /* All strides are expressed in number of elements (not bytes) */ typedef int (*DistanceScalarEx)( const InType *x, // input n-vector const ssize_t xs[1], // stride of x const InType *y, // input n-vector const ssize_t ys[1], // stride of y size_t n, // number of elements per vector OutType *out, // output scalar const OutType *const *args, // additional arguments const ssize_t *const *strs // stride of additional arguments ); typedef int (*DistanceVectorEx)( const InType *x, // input m-by-n matrix const ssize_t xs[2], // strides of x const InType *y, // input m-by-n matrix const ssize_t ys[2], // strides of y size_t n, // number of elements per vector size_t m, // number of vectors const OutType *out, // output m-vector const ssize_t outs[1], // stride of out const OutType *const *args, // additional arguments const ssize_t *const *strs // stride of additional arguments ); typedef int (*DistanceMatrixEx)( const InType *x, // input mx-by-n matrix const ssize_t xs[2], // stride of x const InType *y, // input my-by-n matrix const ssize_t ys[2], // stride of y size_t n, // number of elements per vector size_t mx, // number of vectors in x size_t my, // number of vectors in y OutType *out, // output mx-by-my matrix const ssize_t outs[2] , // stride of out const OutType *const *args, // additional arguments const ssize_t *const *strs // stride of additional arguments ); ```Remark. The caller must ensure that (1) the input, output, and extra arguments are properly aligned, (2) the byte stride is a multiple of the element size, and (3) the output buffer does not overlap with any other arguments (i.e. allowing all pointers to be “restrict” qualified).
A backend would provide kernel implementations (with the above signature) for combinations of metrics and data types. SciPy would serve as the front-end to transform a user call to one or more calls to the kernel, after performing preprocessing such as type promotion. To make a kernel available for use by SciPy, the backend would call
Here
metric_spec
includes the metric name, dimension, striding, and the name and dimension of additional arguments to be passed to the kernel, separated by colon. A standalone colon following the metric name indicates stride support. The dimension ofout
and the additional arguments is specified by the single letters
(scalar),v
(n-vector), orm
(n-by-n matrix).dtype_spec
specifies the data type of arguments:x
andy
will have theinput_dtype
, andout
and additional arguments will have theoutput_dtype
.SciPy would keep track of the kernels for each (metric_spec, dtype_spec) combination. The stock implementations are registered at start-up. When distance computation is requested, the best-match kernel is selected to perform the computation. If no match exists, SciPy would attempt type promotion and possibly other transforms to find a usable kernel.
For an end user to “activate” a specialized metric implementation, the minimum they have to do is to add one line anywhere before calling the distance function, e.g.
This assumes that some
fast_euclidean
package registers a fast euclidean kernel for the “euclidean” metric and activates itself upon module import. (Though not shown here, it is also possible to install metrics not recognized by SciPy.)Discussion
How would the proposal help address the challenges listed above?
If the user has a (performance or functional) requirement not satisfied by SciPy, they could install a custom backend with the desired performance / functionality and use it seamlessly with SciPy. No need to wait for SciPy’s implementation (which can take time) or resort to other packages (which can be tricky to integrate).
Broadcasted distance could be easily implemented by calling the PairedDistance kernel. (In fact, all the stock C++ metric implementations are essentially wrapping PairedDistance kernels.)
How does the proposal interact with the existing distance computation infrastructure in SciPy?
The existing implementation of SciPy remains largely intact, except that some refactoring would be needed to isolate the core distance computation code from input preprocessing.
I expect SciPy's stock implementations to continue to improve for general use.
Is the proposed mechanism backward compatible?
Yes, the proposal is fully backward compatible for users of SciPy.
Request for comments
I’d love to hear your feedback on this proposal. Do you think it addresses a valid problem? Do you think the solution makes sense? Do you think it might attract interest to vendor a backend? Any thoughts about the technical details (such as the kernel signature)? All sorts of comments are welcome!
@peterbell10 @czgdp1807 @tylerjereddy @ashvardanian @thomasjpfan