Nearest Neighbour algorithms

rust-ml / linfa

A Rust machine learning framework.

Apache License 2.0

3.74k stars 244 forks source link

Nearest Neighbour algorithms #119

Open YuhanLiin opened 3 years ago

YuhanLiin commented 3 years ago

Nearest neighbour algorithms are an item in the roadmap. They're helpful in clustering algorithms, particularly DBSCAN. There are a number of existing implementations of advanced nearest neighbour algorithms in the Rust ecosystem, but they are not compatible with ndarray so they'd need to be modified to be included in Linfa.

Existing algorithms I found:

Linear search: The simplest and least efficient method. O(N) time. Already exists in our DBSCAN implementation.
K-D tree: O(logN) data structure. The kdtree crate looks pretty solid.
BSP tree: O(logN) data structure. This binary_space_partition crate is pretty basic so I'm not sure how useful it is to us. BSP tree is also the nearest neighbour implementation use by SmartCore.
Ball tree: O(logN) as well. This ball_tree crate implements the algorithm.
R* tree: O(logN) algorithm. The rstar crate is the most well-maintained of the nearest neighbour implementations with existing benchmarks.

We should pick at least one of the O(logN) algorithms to integrate into Linfa, along with linear search due to its simplicity. I'm not sure how many of these algorithms we really need since I don't know the detailed tradeoffs between them.

Sauro98 commented 3 years ago

I used the rstar crate in an initial implementation of approximate dbscan algorithm and it was nice to use. The only thing is that it wasn't possible to use it with an arbitrary number of features, so I had to use const generics. Maybe now that const generics are in stable rust it could be easier easier to use it inside linfa, it would certainly be interesting to try 👍🏻

YuhanLiin commented 3 years ago

For it to be compatible with the rest of Linfa we'd have to make rstar work with ndarray, which uses runtime values as dimensions, so I don't think const generics helps there.

bytesnake commented 3 years ago

this is a good analysis of the state of space partition algorithms atm :+1: do you think that approximate nearest neighbor would also fit into that? I have used the hnsw crate for sparse kernel approximation here https://github.com/rust-ml/linfa/blob/master/algorithms/linfa-kernel/src/sparse.rs#L24 and it worked well

YuhanLiin commented 3 years ago

IMO approximate nearest neighbour is a different problem from nearest neighbour, so it'd have its own dedicated section/module. We could also aggregate/integrate approximate nearest neighbour algorithms into Linfa, but I'm not familiar with the state of approximate neighbours in the ecosystem so I decided to stick to normal nearest neighbours.

bytesnake commented 3 years ago

yeah approximate nearest neighbor should be added behind a feature flag for linfa-nn. This could pull in an additional dependency and add a hyper-parameter to set the recall/speed tradeoff. First we should stick to the space partition you've listed.

YuhanLiin commented 3 years ago

One concern is that 'rstar' (and likely many other listed crates as well) only accepts inputs with dimensionality known at compile time. This is incompatible with ndarray, so integrating the algorithm into Linfa would involve putting a reworked version of the code into our repo instead of relying on the crate directly.

bytesnake commented 3 years ago

only rstar has this constraint, the others are allowing dynamically sized arrays, but they are not looking as complete by far. Another popular approach are vintage point trees. There is vpsearch but it only implements a nearest neighbor query and no k/eps queries. (also 1 and 2)

YuhanLiin commented 3 years ago

Are we interested in parametrizing these algorithms by different distance functions or do we only care about Euclidean/square Euclidean distance?

bytesnake commented 3 years ago

you could add a simple enum with only a single variant for euclidean distances. If people are interested for other, they could just extend that in the future

YuhanLiin commented 3 years ago

Should the distance function be parametrized at compile-time (via a type parameter) or run-time (via an enum)? Run-time parametrization might add a function-call overhead so it may not be ideal. Also, do we want to abstract over all nearest neighbour functions with a trait?

bytesnake commented 3 years ago

I added some comments in #120 one way how you could solve the type vs enum parametrization is by having both. You could add a distance trait:

trait CustomDistance<F: Float> {
    fn distance<'a>(a: ArrayView1<'a, F>, b: ArrayView2<'a, F>) -> F;
}

and accept that in your type definitions. An enum would define common distance functions and also implements the CustomDistance trait:

enum CommonDistances<F> {
    OneNorm,
    TwoNorm,
    PNorm(F),
    InfNorm,
    //...
}

impl<F: Float> CustomDistance<F> for CommonDistances<F> {
    //...
}

The default parameter would then return CommonDistances::TwoNorm. I doubt that you are running into performance issues when querying for points with a branching distance function. But in doubt you could measure the performance gain compared to a custom distance object.

YuhanLiin commented 3 years ago

Thanks to #120, linfa-nn has been created and merged. The following are possible future enhancements.

Algorithms implemented:

[x] Linear scan
[x] K-D tree, wrapping over kdtree
[x] Ball Tree.
- [ ] Optimize ball tree performance to be on par with K-D tree
[ ] BSP Tree. The binary_space_partition crate doesn't implement the queries we need, so we definitely can't depend on it directly.
[ ] R* Tree. The rstar crate is incompatible with ndarray, so we'll have to modify and integrate it into our code.
[ ] Approximate nearest neighbour algorithms. The current traits in linfa-nn should work fine regardless of how precise the implementation is.

Miscellaneous:

[ ] Add benchmarks for points with higher dimensional data? (might give us some extra insights, shouldn't be a complex code change).
[x] Breaking Changes: The queries need to return not just the points, but the indices of the points within the original dataset, since some algorithms like DBSCAN operate on indices. ~~Also from_batch() should accept ArrayView2 rather than Array2 as input, since Array2 can be converted into ArrayView2 but not the other way around~~ Trait objects of NearestNeighbour are no longer supported and have been replaced with enums. As a result from_batch() now accepts ArrayBase, which is more flexible than before .

Other crates:

[x] Integrate linfa-nn into DBSCAN.
[x] Integrate linfa-nn into Approximate DBSCAN.
[x] Integrate linfa-nn into kernel and SVM functions.
[x] Parametrize applicable algorithms by the distance function.
- [x] K-means (linfa-clustering)
- [x] linfa-svm
- [x] linfa-kernel
- [x] to_gaussian_similarity in linfa-reduction

At this moment the linfa-nn doesn't implement any trait from the main Linfa crate. We considered using Fit to abstract over the from_batch() trait method, but from_batch() requires that the input dataset shares the same lifetime as the output index, which is not a restriction expressed by Fit. We also considered using Transformer to return memoized versions of the k_nearest and within_range queries. Transformer would return the embedding which is a square matrix with either k non-zero entries or entries with less-than-epsilon distance for each point. The problem is that the parameters "k" and "epsilon" would have to be defined before calling transform(), while in the current code they are simply passed into the queries. See this comment for more details.

Sauro98 commented 3 years ago

Approximated dbscan would benefit from using a nn algorithm instead of the recursive search it uses now (https://github.com/rust-ml/linfa/blob/master/algorithms/linfa-clustering/src/appx_dbscan/cells_grid/cell.rs at line 180). I had a (very quick) look at the current nn implementation and it should be doable, similarly to how I did it here at line 103 (sorry for the comments in Italian). I had planned to do it in the near future but I'm quite busy rn, so if somebody else wants to do it go ahead 👍🏻

I admit I did not look too in depth at the nn code, but it seems to me like the kd-tree implementation could be used

vadixidav commented 3 years ago

I wanted to comment on this to discuss interoperability with rust-cv, but also to show what we have done so far on this task.

For abstractions on nn algorithms, we are currently using this crate: https://docs.rs/space/0.17.0/space/. The abstractions here are designed assuming you have a pre-parameterized nearest neighbor search data structure. There are traits for metrics, knn search, knn point sets, knn maps (where points are keys that correspond to values), and inserting things into knn maps.

While I haven't promoted it very publicly yet, I have created an alternative to HNSW here: https://github.com/rust-cv/hgg. hgg is on crates.io and implements the traits from the space crate. hgg has performed better for me than HNSW on the datasets I have worked with. I have primarily benchmarked it with binary features, which are probably not something linfa works with much, but I would love it if some people in the Rust ML community would perform some benchmarks. I will attach some benchmark results as a CSV file to this message:

recall_insert_knn_64.csv recall_refactor_hgg_lite.csv

These benchmarks were ran on AKAZE feature descriptors (486 bits each) padded with zeros to 512 bits. The features were extracted from the KITTI dataset as explained on the README. All benchmarks were performed with one thread on an Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz 3.31 GHz. Based on what I have seen, I recommend this crate over the HNSW crate.

I would also like to see if we can standardize our method of creating a metric space. The space crate explicitly uses unsigned integer values, chosen by the Metric implementation: https://docs.rs/space/0.17.0/space/trait.Metric.html. The reason for this is:

Floats themselves can actually violate the triangle inequality if a returned distance is NaN
Some spaces, like hamming space, have small integer distances (like 8 or 16 bits), and for performance reasons it is actually important that the smallest integer type is used
All IEEE floats can be converted into unsigned integers by simply using .to_bits(), and their order is perfectly preserved, so long as they are all positive floating point numbers (even positive infinity is permitted, but NaN is not). Also, all ordered positive distances can be mapped to the range of unsigned integers, provided you choose a sufficiently large unsigned integer.
The comparison operation for unsigned integers doesn't have edge cases that can come up later, and the burden of eliminating NaNs is pushed to the user, rather than the ANN search process

I could see the need for an additional trait for spaces that are more specific than metric spaces, but for metric spaces I think that representing distances with the unsigned integers is reasonable considering the above.

Would linfa be interested in using the space crate for the abstraction over metric spaces and knn? Any thoughts? I just kind of threw the things I was thinking about out here to see what everyone had to say.

YuhanLiin commented 3 years ago

linfa uses floats over ints in its algorithms because integer division error can cause correctness issues, in additional to floats being the industry standard in other ML libraries. linfa-nn uses floats to interoperate with the rest of linfa. Regarding NaN issues, we use noisy_float in linfa-nn to catch all instances of NaN in debug builds (this is too expensive to do for release builds). Also we need the algorithms to work for negative floats, which can't just be converted to unsigned ints. Thus I would still like to keep the current version of the NN algorithms on floats, but we could generalize the algorithms across both ints and floats.

Ideally I'd like to reexport traits from space instead of rolling out our own for NN, but there's a few things missing from space that we need in linfa. The biggest blocker is the lack of range queries, which we use to speed up DBSCAN. There's also no trait for constructing a kNN index from a batch of points, though that's something we can roll ourselves. Additionally linfa-nn algorithms works entirely with ArrayViews, which are reference types, so using them with space traits may cause some weird lifetime errors (not too sure about this one).

vadixidav commented 3 years ago

linfa uses floats over ints in its algorithms because integer division error can cause correctness issues, in additional to floats being the industry standard in other ML libraries. linfa-nn uses floats to interoperate with the rest of linfa. Regarding NaN issues, we use noisy_float in linfa-nn to catch all instances of NaN in debug builds (this is too expensive to do for release builds). Also we need the algorithms to work for negative floats, which can't just be converted to unsigned ints. Thus I would still like to keep the current version of the NN algorithms on floats, but we could generalize the algorithms across both ints and floats.

Yes, on this front, it sounds like you have some algorithms that don't generalize to discrete spaces, but you might have some that do. It might make sense then for me to see if we can create a more specific abstraction for these floating point metrics. For discrete spaces, I am doing NN search with algorithms that perform no division. The algorithms only perform comparisons, and the only requirement is that the triangle inequality is fulfilled. For algorithms where you are subdividing space in ways that assume it can be partitioned infinitely, we might want to just create a separate and more-specific trait for those spaces. I will probably need to dig into what is in linfa today to see your requirements more exactly.

The biggest blocker is the lack of range queries, which we use to speed up DBSCAN.

We can make this happen with additional traits. Again, we might also need abstractions for general "metric spaces", but we can do some digging to figure that out.

There's also no trait for constructing a kNN index from a batch of points, though that's something we can roll ourselves.

In this case I would suggest allowing the user to provide a closure that returns a new instance of the Knn datastructure so you can insert the points into it. If you just mean use of FromIterator for using .collect() like a HashMap, that could just be implemented on the data structure if implements Default and supports that. Either way, I think this can be implemented downstream, and I don't think a shared trait is needed here, but let me know if you think this is specifically needed.

Additionally linfa-nn algorithms works entirely with ArrayViews, which are reference types, so using them with space traits may cause some weird lifetime errors (not too sure about this one).

I think this one is a bit complicated. It isn't possible to store Array2 as points and get ArrayView2 out when you query the nearest points, you would get &Array2 in this case, where the reference depends on the datastructure lifetime. You also can use ArrayView2 as points, but then the lifetime of the data structure will be tied to your ndarray arrays (as it would be in any case), and you would get back &ArrayView2 when performing searches. In either case, to convert to the type you want, you would need to consume the iterator and convert the outputs into ArrayView, putting them into another location (like a Vec). Part of this has to do with the fact that the Generic Associated Type feature puts hard limitations on what I can do with lifetimes in the trait. If this feature is merged soon, then we can resolve that issue, but it might take longer than is currently anticipated by the authors. It seems they currently believe they can get it done before the end of this year. Once that is merged, it will be possible for us to specify iterators that depend on the lifetime of &self when they are borrowed, and we would also be able to specify what type we get back when we perform searches on a particular type using a trait that describes the borrowing (or copying) of that type. We will be further empowered once the existential types feature is complete. For now though, there are several limitations to the implementation due to a lack of type-system features. For instance, it has to return a Vec of references to points because the lifetime to &self has to show up somewhere in the return type, and we don't have access to GATs yet. Basically, I get around these issues by performing more allocation than we should, but I only do it because it has to be done to create an abstraction. Once GATs are merged, these issues go away.

I can perform some preliminary work to see if we can create a cohesive set of traits that will work for all of the spatial queries that linfa needs to do in its clustering searches. You mentioned clustering algorithms, so I will look there (particularly at DBSCAN) and at the existing nearest neighbor search algorithms, but is there anywhere else in linfa that nearest neighbor is being used that I should also take a look at? I would like to make sure that I am able to cover all the bases we can. If there is something we can't write a trait/abstraction for today, then we could just choose a concrete nearest neighbor algorithm for that specific instance, but hopefully we can work out abstractions for everything.

YuhanLiin commented 3 years ago

For discrete spaces the you can simply convert your float data into equivalent integer data and feed it into the existing traits/algorithms (this requires the traits to be generalized across floats and ints). The algorithms and interface shouldn't change between discrete and continuous spaces, so I don't think we need new traits. Precision issues with ints actually go beyond just division and partitioning, since errors also happen with square rooting. This means many distances metrics, such as Euclidean distance, don't work with ints/discrete spaces. From my understanding the difference between discrete and non-discrete spaces lies in whether the underlying data are integers, so we could get clever with our bounds and disallow the use of non-float types for continuous-space algorithms and distance metrics.

The reason why I wanted batch construction is because some kNN data structures, such as the tree-based ones in linfa, perform much better when constructed with all the points at once instead of inserting the points one-by-one. This is useful in DBSCAN, where all the points are known beforehand. We can abstract this by having a trait called something like KnnBatch with a constructor method that takes a batch of points. By default it inserts the points one-by-one, but the relevant data structures can implement their own batch constructors.

Regarding lifetimes, our current algorithms don't store Array2s at all. They treat each row in the data as a single point, and stores references to those points (ArrayView1s) in a tree structure. The lifetimes of the stored points are tied to the input data, but the lifetime of query points aren't tied to that of the stored points. In fact, the lifetimes of the two points don't need to be related when calculating distance. I'm not sure if this type of decoupling is possible in space without GATs.

Currently the only place in linfa using kNN algorithms is DBSCAN. We do use our abstract distance metric trait in both DBSCAN and KMeans, and are planning to integrate it into some other algorithms.

vadixidav commented 3 years ago

For discrete spaces the you can simply convert your float data into equivalent integer data and feed it into the existing traits/algorithms (this requires the traits to be generalized across floats and ints). The algorithms and interface shouldn't change between discrete and continuous spaces, so I don't think we need new traits. Precision issues with ints actually go beyond just division and partitioning, since errors also happen with square rooting. This means many distances metrics, such as Euclidean distance, don't work with ints/discrete spaces. From my understanding the difference between discrete and non-discrete spaces lies in whether the underlying data are integers, so we could get clever with our bounds and disallow the use of non-float types for continuous-space algorithms and distance metrics.

What I would like to point out here is that there are very specific metric spaces you are working with in this case. For instance, you might have some kNN data structures that are specialized in working with euclidean space specifically. In that case, we should probably have a EuclideanMetric trait that does use a float as the unit of distance. However, for kNN data structures that abstract over any metric space (like HNSW), the only thing they rely upon is the triangle inequality, so they cant make assumptions about the underlying space that would allow them to perform a division or square root operation, which is why the integer is a better unit of distance in that case. If there are more specific requirements that kNN data structures might have, we would want to create more specific traits for each one of those scenarios and then provide blanket implementations of those traits for more general kNN data structures (like HNSW) that use the general trait Metric.

The reason why I wanted batch construction is because some kNN data structures, such as the tree-based ones in linfa, perform much better when constructed with all the points at once instead of inserting the points one-by-one. This is useful in DBSCAN, where all the points are known beforehand. We can abstract this by having a trait called something like KnnBatch with a constructor method that takes a batch of points. By default it inserts the points one-by-one, but the relevant data structures can implement their own batch constructors.

I think it would be valuble to use an abstraction here, but I think that the existing Rust traits Extend and FromIterator would allow you to abstract over adding these points. If it makes more sense to add a specific trait to space, we can do that, but I think that these should suffice.

Regarding lifetimes, our current algorithms don't store Array2s at all. They treat each row in the data as a single point, and stores references to those points (ArrayView1s) in a tree structure. The lifetimes of the stored points are tied to the input data, but the lifetime of query points aren't tied to that of the stored points. In fact, the lifetimes of the two points don't need to be related when calculating distance. I'm not sure if this type of decoupling is possible in space without GATs.

The traits in space do not currently support this paradigm, but it makes sense and is important. We might need to have separate types for the returned borrowed points and the inserted points, but, as you mention, I don't think we can do that without GATs, which is unfortunate. I don't think I have a solution off the top of my head to this problem, so I will spend some more time later thinking about how we might deal with this scenario. We might just need to wait for GATs to stabilize.

Currently the only place in linfa using kNN algorithms is DBSCAN. We do use our abstract distance metric trait in both DBSCAN and KMeans, and are planning to integrate it into some other algorithms.

Perfect. I will take a look at those then. Thanks!

vadixidav commented 3 years ago

I think it would be valuble to use an abstraction here, but I think that the existing Rust traits Extend and FromIterator would allow you to abstract over adding these points. If it makes more sense to add a specific trait to space, we can do that, but I think that these should suffice.

Actually, I am going to go back on this. I can add an insert_multiple method to the existing KnnInsert trait that has a default implementation that simply calls insert on every item. I don't see why we shouldn't have that.

YuhanLiin commented 3 years ago

What I would like to point out here is that there are very specific metric spaces you are working with in this case. For instance, you might have some kNN data structures that are specialized in working with euclidean space specifically. In that case, we should probably have a EuclideanMetric trait that does use a float as the unit of distance. However, for kNN data structures that abstract over any metric space (like HNSW), the only thing they rely upon is the triangle inequality, so they cant make assumptions about the underlying space that would allow them to perform a division or square root operation, which is why the integer is a better unit of distance in that case. If there are more specific requirements that kNN data structures might have, we would want to create more specific traits for each one of those scenarios and then provide blanket implementations of those traits for more general kNN data structures (like HNSW) that use the general trait Metric.

I think if a specific algorithm/metric only works with specific number types, we can express that restriction on the type level rather than the trait level. For example, if HNSW only works with unsigned ints, then the type bounds should be Hnsw<I: Unsigned>, which would prevent HNSW to be used with floats. For something like Euclidean distance, the type bounds would be Euclidean<F: Float>. If an algorithm does not have either restriction, then we can relax the bounds and allow both ints and floats. Once we restrain the types we can use the same set of traits, since we have already eliminated "invalid" parametrizations such as HNSW with floats.

I think it would be valuble to use an abstraction here, but I think that the existing Rust traits Extend and FromIterator would allow you to abstract over adding these points. If it makes more sense to add a specific trait to space, we can do that, but I think that these should suffice.

The reason I want to use a separate abstraction is because for some data structures, the algorithm for constructing from a batch of data is different than inserting data one-by-one. For those cases we can't just compose FromIterator with insert.

The traits in space do not currently support this paradigm, but it makes sense and is important. We might need to have separate types for the returned borrowed points and the inserted points, but, as you mention, I don't think we can do that without GATs, which is unfortunate. I don't think I have a solution off the top of my head to this problem, so I will spend some more time later thinking about how we might deal with this scenario. We might just need to wait for GATs to stabilize.

Actually if we look at the Add trait it treats LHS and RHS as separate generic types, but RHS defaults to the type of LHS. Maybe we can do something like that for the Metric traits?

YuhanLiin commented 3 years ago

I just took another look at Metric and it looks like only the output of the distance function is converted to int; the input can still be floats. The only operations we do with distance values in Linfa are comparisions and basic arithmetic (eg subtractions), and I'm not sure how well ints work in that case. Also, if the distance function actually produces a NaN, Metric would still convert that NaN into an int, which causes an error in the algorithm. I don't see how converting distances to ints shifts the burden of eliminating NaNs onto the user, since NaN related bugs will still pop up when we run the algorithm. I'd argue that NaN bugs are the burden of users regardless of how we represent distances, since our algorithms assume that the data has no NaN. Integer distances don't help with maintaining triangle equality because they, like float distances, break down when a NaN is produced.

vadixidav commented 3 years ago

I don't see how converting distances to ints shifts the burden of eliminating NaNs onto the user, since NaN related bugs will still pop up when we run the algorithm. I'd argue that NaN bugs are the burden of users regardless of how we represent distances, since our algorithms assume that the data has no NaN.

The user should panic if they see a distance which is NaN, since that would make the output of the data structure invalid. If the user knows their data has no NaN in it, then they can skip this step, for instance if they performed validation earlier. As you said, using float or int in this scenario doesn't provide us with any benefit.

The only operations we do with distance values in Linfa are comparisions and basic arithmetic (eg subtractions), and I'm not sure how well ints work in that case.

So long as the index supports arbitrary metric spaces (not just L2, for instance, but even hamming distance or Jaccard distance (1.0 - Jaccard similarity)) then all distances of all precisions can be mapped to the integers given a sufficiently sized integer. For algorithms which don't operate over all metric spaces, more specific units of distance can be used, like f64. Unfortunately, we cannot sort f64 directly either, as f64 do not implement the Ord trait in Rust, as they are not fully ordered. However, since the integers are ordered, they can be used without any extra considerations that the data structure has to make when sorting. hgg takes advantage of this fact, and sorts distances in many places.

Nearest neighbor algorithms which can use BOTH f64 and u64 should simply use u64, as the unsigned integer case provides equivalent ordering and is thus an abstraction over the concept. So, for instance, hgg uses the unsigned integer type.

However, for nearest neighbor algorithms which only work on a subset of those metric spaces which only use f64, or if they rely on a property of f64, we can make a separate Metric trait, for instance, L2Metric, which requires the distance type to be f64. If your algorithm requires f64, then it is likely not abstracting over all metric spaces (which satisfy the triangle inequality), and so we should create a more specific trait so that users know to implement that specific trait if they have a more specific metric.

For instance: If you have a data structure that depends on a distance metric that relies on L2 distances for the underlying space, then it takes a metric that satisfies L2Metric. However, we will also add blanket impls for impl<T> Metric for T where T: L2Metric. The reason we do this is so that the more specific metric ALSO supports data structures which support more general metrics. The implementation could look like:

trait FloatToPositiveIntegerOrdering {
    type Int: Unsigned + Ord + Copy;

    fn to_positive_integer_ordering(self) -> Self::Int;
}

impl<P, T, F> Metric<P> for T where T: L2Metric<P, Unit=F> where F: FloatToPositiveIntegerOrdering {
    type Unit = F::Int;

    fn distance(&self, a: &P, b: &P) -> Self::Unit {
        L2Metric::distance(self, a, b).to_positive_integer_ordering()
    }
}

Now so long as everybody correctly implements the correct MetricL2, MetricL1, etc traits, everyone should be happy. Nobody has to do any extra work, and all of the floats get automatically converted. Every data structure which performs faster with integers can utilize them. Every data structure that needs floats gets them as well.

YuhanLiin commented 3 years ago

Ordering floats isn't actually an issue because there are zero-cost abstractions that implement Ord for floats that we currently use (eg noisy-float). Like with integers, those abstractions only break down when encountering a NaN, so ints are floats are equally valid when it comes to ordering/sorting. Are there any other reasons to use float-to-int conversions? You mentioned that some algorithms perform faster, can you elaborate?

vadixidav commented 3 years ago

Ordering floats isn't actually an issue because there are zero-cost abstractions that implement Ord for floats that we currently use (eg noisy-float).

If one data-structure uses noisy-float, another uses float-ord, and another uses https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp, then we now have inconsistent concepts of ordering with floats. So every data structure will interpret their order differently. We would expect every spatial datastructure to return the nearest neighbors in the same order, and for them to use the same ordering. Floats have no defined ordering, so we have to determine that ordering at a specific point. I will discuss further down how we can allow floating metrics while also allowing the metric to determine its own ordering, which should deal with this problem.

Like with integers, those abstractions only break down when encountering a NaN, so ints are floats are equally valid when it comes to ordering/sorting.

I agree with this. We need to eliminate NaNs from floats, and I suggest that we do this as soon as possible. Rather than checking this repeatedly and wrapping floats in the downstream implementation to specially treat the ordering of NaNs, this responsibility is moved to the implementer of the trait, who must ensure that there are no NaNs.

I would also note that this does not mean that floats are inherently better than integers, it only means that when it comes to NaN that int and float are equally valid in terms of enforcing the triangle inequality assuming that the user is responsible for filtering out NaNs either way.

Are there any other reasons to use float-to-int conversions?

This allows the downstream crate to assume a full ordering Ord, which I would expect for a metric, and it makes their implementation simpler. Any data structure which abstracts over all metric spaces will have simpler code with unsigned integers than with floats. I can guarantee this since in an arbitrary metric space, the only thing you can rely on is the triangle inequality, therefore the only operations you can use on floats can also be done on integers. Some operations cannot be done on raw floats because they lack the Ord trait (like matching on a.cmp(&b) or sort_unstable()). By making this conversion, it will make all downstream implementations simpler, since they can rely on the Ord trait to work properly.

An alternative to this would be that we could remove the Unsigned bound on the Metric trait, and instead bound it with Ord + Copy + Zero. In the HGG crate, I regularly rely on the fact that points that occupy the same space have a metric distance of zero: https://github.com/rust-cv/hgg/blob/main/src/lib.rs#L1384. If we bound the distance type with Ord + Copy + Zero, then the person implementing the distance can choose the ordering of the float themselves, and we can still abstract over the integers without needing to convert them to floats. So perhaps this is the solution that will work best for us. This way we can utilize NoisyFloat still. Note that NoisyFloat does implement the Zero trait, which means it will satisfy the trait bound Ord + Copy + Zero.

You mentioned that some algorithms perform faster, can you elaborate?

If the above idea of allowing any distance type that implements Ord + Copy + Zero is sufficient, I think we don't need to get into this specific topic. However, if you would like to see benchmarks showing the difference in performance between float and integer distance units, I can set my recall benchmark for HGG here to test both float and int cases (with integers being converted to floats), then we can compare the two on equal footing on a real-world dataset that exercises fast distance calculation and data-locality (and thus will stress the comparison operation).

Let me know if the type Unit: Ord + Copy + Zero (with Zero coming from here) solution will work for you. I know that it doesn't support f64 directly, but it will allow you to set the unit to a wrapper that ensures total ordering and produces the unit of distance for two things which are equal to each other. If this is good, I think I can make this change for you pretty quickly.

YuhanLiin commented 3 years ago

Ord + Copy + Zero looks perfect. In the implementations we should make sure that NaN checks only happen in debug mode, not release mode, for perf reasons.

vadixidav commented 3 years ago

Ord + Copy + Zero looks perfect. In the implementations we should make sure that NaN checks only happen in debug mode, not release mode, for perf reasons.

Yep, that seems reasonable to me. Here is my update to the docs and the trait bound: https://github.com/rust-cv/space/blob/d47ba8f56ffb85f16f8b24e9312d26113d9449ce/src/lib.rs#L17

Let me know if this looks good, and if so I can update everything. We should also add traits for specific metric spaces as well that you are relying on (such as L2).

YuhanLiin commented 3 years ago

Wait why do we still need separate traits for L2 meter space? Are there algorithms that only work for L2 spaces? The algorithms in Linfa only care if metrics satisfy triangle inequality.

vadixidav commented 3 years ago

Wait why do we still need separate traits for L2 meter space? Are there algorithms that only work for L2 spaces? The algorithms in Linfa only care if metrics satisfy triangle inequality.

You had mentioned dividing the metric before, so I assumed that you were working with a specifc norm. I am not sure that division of the distance metric works in all spaces. Was that only a hypothetical, or is there a situation where you are dividing in Linfa? If so, I would like to see an example, because I am not sure it would work for hamming space, for instance. We might need to figure out exactly what it is that is required by the data structure.

YuhanLiin commented 3 years ago

The division was for finding the median in a set of points, which doesn't actually have anything to do with distance metrics. I mentioned division before because I thought you wanted to convert all the floats into ints. We also have a subtraction of distance values. Specifically, we do distance-from-center-of-circle - radius = distance-to-edge. Are operations like these supported by only a subset of spaces/metrics?

vadixidav commented 3 years ago

The division was for finding the median in a set of points, which doesn't actually have anything to do with distance metrics. I mentioned division before because I thought you wanted to convert all the floats into ints. We also have a subtraction of distance values. Specifically, we do distance-from-center-of-circle - radius = distance-to-edge. Are operations like these supported by only a subset of spaces/metrics?

Subtraction by a radius is perfectly valid, so long as you consider the less than vs less than or equal to cases.

Computing the median by dividing seems like it might cause some problems. Typically, the way I compute the median roughly is that I collect all of the items into a Vec, then I sort the vector, and then I choose the item at length / 2. I understand that this is not the true median if you have two distances in the middle of the vector, but in arbitrary metric spaces, the average of two distances is somewhat arbitrary. Is this the reason why the division is required? We might be able to create a trait bound that could help find the average of two distances.

vadixidav commented 3 years ago

I am going to expand the trait bounds on the distance type to require the types to support addition and subtraction for that use case you mentioned with radius (I assume ball tree?). Just let me know whether you need division as well and what for, and then we can figure out what additional trait bounds might be needed.

YuhanLiin commented 3 years ago

I just checked the ball tree code (which I wrote a while ago) and you're right that the median operation doesn't use division. I mixed it up with another implementation which uses midpoints instead of median. However, there's an operation that averages several points to find the center of a sphere. This requires division for the points, and it may be a good idea to restrict the algorithm to floats to prevent rounding errors. The distances only need to support add/subtract. I'm not sure if any arithmetic type bounds need to be expressed in space, since they can just be placed in the ball tree type signature. On another note, the library we use for KD trees only accepts floats, probably because it most likely also uses division.

vadixidav commented 3 years ago

However, there's an operation that averages several points to find the center of a sphere.

Based on what I am reading on the Wikpedia page, I think that this would be a specific implementation of the Ball Tree that relies on having f32 vectors as points in L2 space, and the central point could even be chosen by other processes (such as the point closest to the geometric median). You might be able to relax this floating point requirement so that integers can also be used by setting the radius of the ball based on the median and the center of the ball using an existing point in the data set using a process other than the mean.

I would also add that if the ball center is chosen by the centroid, then that assumes that the underlying metric space uses an L2 distance metric. See here for more details: https://en.wikipedia.org/wiki/Centroid#Of_a_finite_set_of_points. The centroid minimizes the sum of squared euclidean distances specifically, and so this would not be relevant for other metric spaces.

If you would like to generalize the ball tree to arbitrary dimensions, one option could be to find the point which minimizes the sum of the distances to all other points, which would be closest to the geometric median. The only problem with this algorithm is that summing a large number of distances could overflow an unsigned integer, so if you would be interested in that, we will need to come up with an additional trait bound that allows two iterators over distances to be compared for their sum to work around the overflow problem. For floats, we would probably want to do this too anyways, since we should probably use a Kahan sum. Additionally, you could choose the center of the ball randomly. Either way, I just want to point out that taking the average to compute the centroid is specific to L2. It would not work on, for instance, strings using levenshtein distance. However, you could create a ball tree over strings using levenshtein distance if the center point is chosen using a different method such as minimizing the sum of distances.

Basically, if we need an additional trait to handle one of these extra situations, we can do that.

On another note, the library we use for KD trees only accepts floats, probably because it most likely also uses division.

Yes, this seems reasonable. KD trees over L2 floating vector spaces should probably be specific to floats, whereas KD trees over finite L2 spaces should probably be implemented separately (such as point clouds using integer morton encoded coordinates).

KD trees also cannot abstract over all metric spaces, of course, and for them we could use an L2Metric trait (see previous comment for rough definition) to be used by the Knn trait implementation.

YuhanLiin commented 3 years ago

How would we calculate the geometric median for non-L2 spaces? Do we use an approximation? In that case how slow would it be? We should probably have different ways to calculate geometric medians/centroids, one for L2 spaces and one for other spaces. It seems possible to generalize ball trees across all types of spaces, so I'm willing to do that within Linfa once the necessary machinery is complete. Also, is it possible to have a non-L2 space with floats? In that case we won't be able to restrict KD trees to L2 spaces by bounding its numerical type to Float, so we'd need different metric traits for different spaces.

BTW, generally speaking, do data structures that support KNN operations also support range queries?

shikhar commented 2 years ago

Another neat HNSW implementation https://github.com/InstantDomain/instant-distance

YuhanLiin commented 2 years ago

instant-distance only supports f32, so I don't think it fits into our trait.