Open YuhanLiin opened 3 years ago
I used the rstar
crate in an initial implementation of approximate dbscan algorithm and it was nice to use. The only thing is that it wasn't possible to use it with an arbitrary number of features, so I had to use const generics. Maybe now that const generics are in stable rust it could be easier easier to use it inside linfa, it would certainly be interesting to try 👍🏻
For it to be compatible with the rest of Linfa we'd have to make rstar
work with ndarray
, which uses runtime values as dimensions, so I don't think const generics helps there.
this is a good analysis of the state of space partition algorithms atm :+1: do you think that approximate nearest neighbor would also fit into that? I have used the hnsw crate for sparse kernel approximation here https://github.com/rust-ml/linfa/blob/master/algorithms/linfa-kernel/src/sparse.rs#L24 and it worked well
IMO approximate nearest neighbour is a different problem from nearest neighbour, so it'd have its own dedicated section/module. We could also aggregate/integrate approximate nearest neighbour algorithms into Linfa, but I'm not familiar with the state of approximate neighbours in the ecosystem so I decided to stick to normal nearest neighbours.
yeah approximate nearest neighbor should be added behind a feature flag for linfa-nn
. This could pull in an additional dependency and add a hyper-parameter to set the recall/speed tradeoff. First we should stick to the space partition you've listed.
One concern is that 'rstar' (and likely many other listed crates as well) only accepts inputs with dimensionality known at compile time. This is incompatible with ndarray
, so integrating the algorithm into Linfa would involve putting a reworked version of the code into our repo instead of relying on the crate directly.
Are we interested in parametrizing these algorithms by different distance functions or do we only care about Euclidean/square Euclidean distance?
you could add a simple enum with only a single variant for euclidean distances. If people are interested for other, they could just extend that in the future
Should the distance function be parametrized at compile-time (via a type parameter) or run-time (via an enum)? Run-time parametrization might add a function-call overhead so it may not be ideal. Also, do we want to abstract over all nearest neighbour functions with a trait?
I added some comments in #120 one way how you could solve the type vs enum parametrization is by having both. You could add a distance trait:
trait CustomDistance<F: Float> {
fn distance<'a>(a: ArrayView1<'a, F>, b: ArrayView2<'a, F>) -> F;
}
and accept that in your type definitions. An enum would define common distance functions and also implements the CustomDistance
trait:
enum CommonDistances<F> {
OneNorm,
TwoNorm,
PNorm(F),
InfNorm,
//...
}
impl<F: Float> CustomDistance<F> for CommonDistances<F> {
//...
}
The default parameter would then return CommonDistances::TwoNorm
. I doubt that you are running into performance issues when querying for points with a branching distance function. But in doubt you could measure the performance gain compared to a custom distance object.
Thanks to #120, linfa-nn
has been created and merged. The following are possible future enhancements.
Algorithms implemented:
ndarray
, so we'll have to modify and integrate it into our code.linfa-nn
should work fine regardless of how precise the implementation is.Miscellaneous:
from_batch()
should accept ArrayView2
rather than Array2
as input, since Array2
can be converted into ArrayView2
but not the other way aroundNearestNeighbour
are no longer supported and have been replaced with enums. As a result from_batch()
now accepts ArrayBase
, which is more flexible than before .Other crates:
linfa-nn
into DBSCAN.linfa-nn
into Approximate DBSCAN.linfa-nn
into kernel and SVM functions.linfa-clustering
) linfa-svm
linfa-kernel
to_gaussian_similarity
in linfa-reduction
At this moment the linfa-nn
doesn't implement any trait from the main Linfa crate. We considered using Fit
to abstract over the from_batch()
trait method, but from_batch()
requires that the input dataset shares the same lifetime as the output index, which is not a restriction expressed by Fit
. We also considered using Transformer
to return memoized versions of the k_nearest
and within_range
queries. Transformer
would return the embedding which is a square matrix with either k non-zero entries or entries with less-than-epsilon distance for each point. The problem is that the parameters "k" and "epsilon" would have to be defined before calling transform()
, while in the current code they are simply passed into the queries. See this comment for more details.
Approximated dbscan would benefit from using a nn algorithm instead of the recursive search it uses now (https://github.com/rust-ml/linfa/blob/master/algorithms/linfa-clustering/src/appx_dbscan/cells_grid/cell.rs at line 180). I had a (very quick) look at the current nn implementation and it should be doable, similarly to how I did it here at line 103 (sorry for the comments in Italian). I had planned to do it in the near future but I'm quite busy rn, so if somebody else wants to do it go ahead 👍🏻
I admit I did not look too in depth at the nn code, but it seems to me like the kd-tree implementation could be used
I wanted to comment on this to discuss interoperability with rust-cv
, but also to show what we have done so far on this task.
For abstractions on nn
algorithms, we are currently using this crate: https://docs.rs/space/0.17.0/space/. The abstractions here are designed assuming you have a pre-parameterized nearest neighbor search data structure. There are traits for metrics, knn search, knn point sets, knn maps (where points are keys that correspond to values), and inserting things into knn maps.
While I haven't promoted it very publicly yet, I have created an alternative to HNSW here: https://github.com/rust-cv/hgg. hgg
is on crates.io and implements the traits from the space
crate. hgg
has performed better for me than HNSW on the datasets I have worked with. I have primarily benchmarked it with binary features, which are probably not something linfa works with much, but I would love it if some people in the Rust ML community would perform some benchmarks. I will attach some benchmark results as a CSV file to this message:
recall_insert_knn_64.csv recall_refactor_hgg_lite.csv
These benchmarks were ran on AKAZE feature descriptors (486 bits each) padded with zeros to 512 bits. The features were extracted from the KITTI dataset as explained on the README. All benchmarks were performed with one thread on an Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz 3.31 GHz
. Based on what I have seen, I recommend this crate over the HNSW crate.
I would also like to see if we can standardize our method of creating a metric space. The space
crate explicitly uses unsigned integer values, chosen by the Metric
implementation: https://docs.rs/space/0.17.0/space/trait.Metric.html. The reason for this is:
.to_bits()
, and their order is perfectly preserved, so long as they are all positive floating point numbers (even positive infinity is permitted, but NaN is not). Also, all ordered positive distances can be mapped to the range of unsigned integers, provided you choose a sufficiently large unsigned integer.I could see the need for an additional trait for spaces that are more specific than metric spaces, but for metric spaces I think that representing distances with the unsigned integers is reasonable considering the above.
Would linfa be interested in using the space
crate for the abstraction over metric spaces and knn? Any thoughts? I just kind of threw the things I was thinking about out here to see what everyone had to say.
linfa
uses floats over ints in its algorithms because integer division error can cause correctness issues, in additional to floats being the industry standard in other ML libraries. linfa-nn
uses floats to interoperate with the rest of linfa
. Regarding NaN issues, we use noisy_float
in linfa-nn
to catch all instances of NaN in debug builds (this is too expensive to do for release builds). Also we need the algorithms to work for negative floats, which can't just be converted to unsigned ints. Thus I would still like to keep the current version of the NN algorithms on floats, but we could generalize the algorithms across both ints and floats.
Ideally I'd like to reexport traits from space
instead of rolling out our own for NN, but there's a few things missing from space
that we need in linfa
. The biggest blocker is the lack of range queries, which we use to speed up DBSCAN. There's also no trait for constructing a kNN index from a batch of points, though that's something we can roll ourselves. Additionally linfa-nn
algorithms works entirely with ArrayView
s, which are reference types, so using them with space
traits may cause some weird lifetime errors (not too sure about this one).
linfa
uses floats over ints in its algorithms because integer division error can cause correctness issues, in additional to floats being the industry standard in other ML libraries.linfa-nn
uses floats to interoperate with the rest oflinfa
. Regarding NaN issues, we usenoisy_float
inlinfa-nn
to catch all instances of NaN in debug builds (this is too expensive to do for release builds). Also we need the algorithms to work for negative floats, which can't just be converted to unsigned ints. Thus I would still like to keep the current version of the NN algorithms on floats, but we could generalize the algorithms across both ints and floats.
Yes, on this front, it sounds like you have some algorithms that don't generalize to discrete spaces, but you might have some that do. It might make sense then for me to see if we can create a more specific abstraction for these floating point metrics. For discrete spaces, I am doing NN search with algorithms that perform no division. The algorithms only perform comparisons, and the only requirement is that the triangle inequality is fulfilled. For algorithms where you are subdividing space in ways that assume it can be partitioned infinitely, we might want to just create a separate and more-specific trait for those spaces. I will probably need to dig into what is in linfa today to see your requirements more exactly.
The biggest blocker is the lack of range queries, which we use to speed up DBSCAN.
We can make this happen with additional traits. Again, we might also need abstractions for general "metric spaces", but we can do some digging to figure that out.
There's also no trait for constructing a kNN index from a batch of points, though that's something we can roll ourselves.
In this case I would suggest allowing the user to provide a closure that returns a new instance of the Knn datastructure so you can insert the points into it. If you just mean use of FromIterator
for using .collect()
like a HashMap
, that could just be implemented on the data structure if implements Default
and supports that. Either way, I think this can be implemented downstream, and I don't think a shared trait is needed here, but let me know if you think this is specifically needed.
Additionally
linfa-nn
algorithms works entirely withArrayView
s, which are reference types, so using them withspace
traits may cause some weird lifetime errors (not too sure about this one).
I think this one is a bit complicated. It isn't possible to store Array2
as points and get ArrayView2
out when you query the nearest points, you would get &Array2
in this case, where the reference depends on the datastructure lifetime. You also can use ArrayView2
as points, but then the lifetime of the data structure will be tied to your ndarray arrays (as it would be in any case), and you would get back &ArrayView2
when performing searches. In either case, to convert to the type you want, you would need to consume the iterator and convert the outputs into ArrayView, putting them into another location (like a Vec
). Part of this has to do with the fact that the Generic Associated Type feature puts hard limitations on what I can do with lifetimes in the trait. If this feature is merged soon, then we can resolve that issue, but it might take longer than is currently anticipated by the authors. It seems they currently believe they can get it done before the end of this year. Once that is merged, it will be possible for us to specify iterators that depend on the lifetime of &self
when they are borrowed, and we would also be able to specify what type we get back when we perform searches on a particular type using a trait that describes the borrowing (or copying) of that type. We will be further empowered once the existential types feature is complete. For now though, there are several limitations to the implementation due to a lack of type-system features. For instance, it has to return a Vec of references to points because the lifetime to &self has to show up somewhere in the return type, and we don't have access to GATs yet. Basically, I get around these issues by performing more allocation than we should, but I only do it because it has to be done to create an abstraction. Once GATs are merged, these issues go away.
I can perform some preliminary work to see if we can create a cohesive set of traits that will work for all of the spatial queries that linfa needs to do in its clustering searches. You mentioned clustering algorithms, so I will look there (particularly at DBSCAN) and at the existing nearest neighbor search algorithms, but is there anywhere else in linfa
that nearest neighbor is being used that I should also take a look at? I would like to make sure that I am able to cover all the bases we can. If there is something we can't write a trait/abstraction for today, then we could just choose a concrete nearest neighbor algorithm for that specific instance, but hopefully we can work out abstractions for everything.
For discrete spaces the you can simply convert your float data into equivalent integer data and feed it into the existing traits/algorithms (this requires the traits to be generalized across floats and ints). The algorithms and interface shouldn't change between discrete and continuous spaces, so I don't think we need new traits. Precision issues with ints actually go beyond just division and partitioning, since errors also happen with square rooting. This means many distances metrics, such as Euclidean distance, don't work with ints/discrete spaces. From my understanding the difference between discrete and non-discrete spaces lies in whether the underlying data are integers, so we could get clever with our bounds and disallow the use of non-float types for continuous-space algorithms and distance metrics.
The reason why I wanted batch construction is because some kNN data structures, such as the tree-based ones in linfa
, perform much better when constructed with all the points at once instead of inserting the points one-by-one. This is useful in DBSCAN, where all the points are known beforehand. We can abstract this by having a trait called something like KnnBatch
with a constructor method that takes a batch of points. By default it inserts the points one-by-one, but the relevant data structures can implement their own batch constructors.
Regarding lifetimes, our current algorithms don't store Array2
s at all. They treat each row in the data as a single point, and stores references to those points (ArrayView1
s) in a tree structure. The lifetimes of the stored points are tied to the input data, but the lifetime of query points aren't tied to that of the stored points. In fact, the lifetimes of the two points don't need to be related when calculating distance. I'm not sure if this type of decoupling is possible in space
without GATs.
Currently the only place in linfa
using kNN algorithms is DBSCAN. We do use our abstract distance metric trait in both DBSCAN and KMeans, and are planning to integrate it into some other algorithms.
For discrete spaces the you can simply convert your float data into equivalent integer data and feed it into the existing traits/algorithms (this requires the traits to be generalized across floats and ints). The algorithms and interface shouldn't change between discrete and continuous spaces, so I don't think we need new traits. Precision issues with ints actually go beyond just division and partitioning, since errors also happen with square rooting. This means many distances metrics, such as Euclidean distance, don't work with ints/discrete spaces. From my understanding the difference between discrete and non-discrete spaces lies in whether the underlying data are integers, so we could get clever with our bounds and disallow the use of non-float types for continuous-space algorithms and distance metrics.
What I would like to point out here is that there are very specific metric spaces you are working with in this case. For instance, you might have some kNN data structures that are specialized in working with euclidean space specifically. In that case, we should probably have a EuclideanMetric
trait that does use a float as the unit of distance. However, for kNN data structures that abstract over any metric space (like HNSW), the only thing they rely upon is the triangle inequality, so they cant make assumptions about the underlying space that would allow them to perform a division or square root operation, which is why the integer is a better unit of distance in that case. If there are more specific requirements that kNN data structures might have, we would want to create more specific traits for each one of those scenarios and then provide blanket implementations of those traits for more general kNN data structures (like HNSW) that use the general trait Metric
.
The reason why I wanted batch construction is because some kNN data structures, such as the tree-based ones in
linfa
, perform much better when constructed with all the points at once instead of inserting the points one-by-one. This is useful in DBSCAN, where all the points are known beforehand. We can abstract this by having a trait called something likeKnnBatch
with a constructor method that takes a batch of points. By default it inserts the points one-by-one, but the relevant data structures can implement their own batch constructors.
I think it would be valuble to use an abstraction here, but I think that the existing Rust traits Extend
and FromIterator
would allow you to abstract over adding these points. If it makes more sense to add a specific trait to space
, we can do that, but I think that these should suffice.
Regarding lifetimes, our current algorithms don't store
Array2
s at all. They treat each row in the data as a single point, and stores references to those points (ArrayView1
s) in a tree structure. The lifetimes of the stored points are tied to the input data, but the lifetime of query points aren't tied to that of the stored points. In fact, the lifetimes of the two points don't need to be related when calculating distance. I'm not sure if this type of decoupling is possible inspace
without GATs.
The traits in space
do not currently support this paradigm, but it makes sense and is important. We might need to have separate types for the returned borrowed points and the inserted points, but, as you mention, I don't think we can do that without GATs, which is unfortunate. I don't think I have a solution off the top of my head to this problem, so I will spend some more time later thinking about how we might deal with this scenario. We might just need to wait for GATs to stabilize.
Currently the only place in
linfa
using kNN algorithms is DBSCAN. We do use our abstract distance metric trait in both DBSCAN and KMeans, and are planning to integrate it into some other algorithms.
Perfect. I will take a look at those then. Thanks!
I think it would be valuble to use an abstraction here, but I think that the existing Rust traits Extend and FromIterator would allow you to abstract over adding these points. If it makes more sense to add a specific trait to space, we can do that, but I think that these should suffice.
Actually, I am going to go back on this. I can add an insert_multiple
method to the existing KnnInsert
trait that has a default implementation that simply calls insert
on every item. I don't see why we shouldn't have that.
What I would like to point out here is that there are very specific metric spaces you are working with in this case. For instance, you might have some kNN data structures that are specialized in working with euclidean space specifically. In that case, we should probably have a EuclideanMetric trait that does use a float as the unit of distance. However, for kNN data structures that abstract over any metric space (like HNSW), the only thing they rely upon is the triangle inequality, so they cant make assumptions about the underlying space that would allow them to perform a division or square root operation, which is why the integer is a better unit of distance in that case. If there are more specific requirements that kNN data structures might have, we would want to create more specific traits for each one of those scenarios and then provide blanket implementations of those traits for more general kNN data structures (like HNSW) that use the general trait Metric.
I think if a specific algorithm/metric only works with specific number types, we can express that restriction on the type level rather than the trait level. For example, if HNSW only works with unsigned ints, then the type bounds should be
Hnsw<I: Unsigned>
, which would prevent HNSW to be used with floats. For something like Euclidean distance, the type bounds would beEuclidean<F: Float>
. If an algorithm does not have either restriction, then we can relax the bounds and allow both ints and floats. Once we restrain the types we can use the same set of traits, since we have already eliminated "invalid" parametrizations such as HNSW with floats.I think it would be valuble to use an abstraction here, but I think that the existing Rust traits Extend and FromIterator would allow you to abstract over adding these points. If it makes more sense to add a specific trait to space, we can do that, but I think that these should suffice.
The reason I want to use a separate abstraction is because for some data structures, the algorithm for constructing from a batch of data is different than inserting data one-by-one. For those cases we can't just compose
FromIterator
withinsert
.The traits in space do not currently support this paradigm, but it makes sense and is important. We might need to have separate types for the returned borrowed points and the inserted points, but, as you mention, I don't think we can do that without GATs, which is unfortunate. I don't think I have a solution off the top of my head to this problem, so I will spend some more time later thinking about how we might deal with this scenario. We might just need to wait for GATs to stabilize.
Actually if we look at the
Add
trait it treats LHS and RHS as separate generic types, but RHS defaults to the type of LHS. Maybe we can do something like that for the Metric traits?
I just took another look at Metric
and it looks like only the output of the distance function is converted to int; the input can still be floats. The only operations we do with distance values in Linfa are comparisions and basic arithmetic (eg subtractions), and I'm not sure how well ints work in that case.
Also, if the distance function actually produces a NaN, Metric
would still convert that NaN into an int, which causes an error in the algorithm. I don't see how converting distances to ints shifts the burden of eliminating NaNs onto the user, since NaN related bugs will still pop up when we run the algorithm. I'd argue that NaN bugs are the burden of users regardless of how we represent distances, since our algorithms assume that the data has no NaN. Integer distances don't help with maintaining triangle equality because they, like float distances, break down when a NaN is produced.
I don't see how converting distances to ints shifts the burden of eliminating NaNs onto the user, since NaN related bugs will still pop up when we run the algorithm. I'd argue that NaN bugs are the burden of users regardless of how we represent distances, since our algorithms assume that the data has no NaN.
The user should panic if they see a distance which is NaN, since that would make the output of the data structure invalid. If the user knows their data has no NaN in it, then they can skip this step, for instance if they performed validation earlier. As you said, using float or int in this scenario doesn't provide us with any benefit.
The only operations we do with distance values in Linfa are comparisions and basic arithmetic (eg subtractions), and I'm not sure how well ints work in that case.
So long as the index supports arbitrary metric spaces (not just L2, for instance, but even hamming distance or Jaccard distance (1.0 - Jaccard similarity)) then all distances of all precisions can be mapped to the integers given a sufficiently sized integer. For algorithms which don't operate over all metric spaces, more specific units of distance can be used, like f64
. Unfortunately, we cannot sort f64
directly either, as f64
do not implement the Ord
trait in Rust, as they are not fully ordered. However, since the integers are ordered, they can be used without any extra considerations that the data structure has to make when sorting. hgg
takes advantage of this fact, and sorts distances in many places.
Nearest neighbor algorithms which can use BOTH f64
and u64
should simply use u64
, as the unsigned integer case provides equivalent ordering and is thus an abstraction over the concept. So, for instance, hgg
uses the unsigned integer type.
However, for nearest neighbor algorithms which only work on a subset of those metric spaces which only use f64
, or if they rely on a property of f64
, we can make a separate Metric
trait, for instance, L2Metric
, which requires the distance type to be f64
. If your algorithm requires f64
, then it is likely not abstracting over all metric spaces (which satisfy the triangle inequality), and so we should create a more specific trait so that users know to implement that specific trait if they have a more specific metric.
For instance: If you have a data structure that depends on a distance metric that relies on L2 distances for the underlying space, then it takes a metric that satisfies L2Metric
. However, we will also add blanket impls for impl<T> Metric for T where T: L2Metric
. The reason we do this is so that the more specific metric ALSO supports data structures which support more general metrics. The implementation could look like:
trait FloatToPositiveIntegerOrdering {
type Int: Unsigned + Ord + Copy;
fn to_positive_integer_ordering(self) -> Self::Int;
}
impl<P, T, F> Metric<P> for T where T: L2Metric<P, Unit=F> where F: FloatToPositiveIntegerOrdering {
type Unit = F::Int;
fn distance(&self, a: &P, b: &P) -> Self::Unit {
L2Metric::distance(self, a, b).to_positive_integer_ordering()
}
}
Now so long as everybody correctly implements the correct MetricL2
, MetricL1
, etc traits, everyone should be happy. Nobody has to do any extra work, and all of the floats get automatically converted. Every data structure which performs faster with integers can utilize them. Every data structure that needs floats gets them as well.
Ordering floats isn't actually an issue because there are zero-cost abstractions that implement Ord
for floats that we currently use (eg noisy-float
). Like with integers, those abstractions only break down when encountering a NaN, so ints are floats are equally valid when it comes to ordering/sorting. Are there any other reasons to use float-to-int conversions? You mentioned that some algorithms perform faster, can you elaborate?
Ordering floats isn't actually an issue because there are zero-cost abstractions that implement
Ord
for floats that we currently use (egnoisy-float
).
If one data-structure uses noisy-float
, another uses float-ord
, and another uses https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp, then we now have inconsistent concepts of ordering with floats. So every data structure will interpret their order differently. We would expect every spatial datastructure to return the nearest neighbors in the same order, and for them to use the same ordering. Floats have no defined ordering, so we have to determine that ordering at a specific point. I will discuss further down how we can allow floating metrics while also allowing the metric to determine its own ordering, which should deal with this problem.
Like with integers, those abstractions only break down when encountering a NaN, so ints are floats are equally valid when it comes to ordering/sorting.
I agree with this. We need to eliminate NaNs from floats, and I suggest that we do this as soon as possible. Rather than checking this repeatedly and wrapping floats in the downstream implementation to specially treat the ordering of NaNs, this responsibility is moved to the implementer of the trait, who must ensure that there are no NaNs.
I would also note that this does not mean that floats are inherently better than integers, it only means that when it comes to NaN that int and float are equally valid in terms of enforcing the triangle inequality assuming that the user is responsible for filtering out NaNs either way.
Are there any other reasons to use float-to-int conversions?
This allows the downstream crate to assume a full ordering Ord
, which I would expect for a metric, and it makes their implementation simpler. Any data structure which abstracts over all metric spaces will have simpler code with unsigned integers than with floats. I can guarantee this since in an arbitrary metric space, the only thing you can rely on is the triangle inequality, therefore the only operations you can use on floats can also be done on integers. Some operations cannot be done on raw floats because they lack the Ord
trait (like matching on a.cmp(&b)
or sort_unstable()
). By making this conversion, it will make all downstream implementations simpler, since they can rely on the Ord
trait to work properly.
An alternative to this would be that we could remove the Unsigned
bound on the Metric
trait, and instead bound it with Ord + Copy + Zero
. In the HGG crate, I regularly rely on the fact that points that occupy the same space have a metric distance of zero: https://github.com/rust-cv/hgg/blob/main/src/lib.rs#L1384. If we bound the distance type with Ord + Copy + Zero
, then the person implementing the distance can choose the ordering of the float themselves, and we can still abstract over the integers without needing to convert them to floats. So perhaps this is the solution that will work best for us. This way we can utilize NoisyFloat
still. Note that NoisyFloat
does implement the Zero
trait, which means it will satisfy the trait bound Ord + Copy + Zero
.
You mentioned that some algorithms perform faster, can you elaborate?
If the above idea of allowing any distance type that implements Ord + Copy + Zero
is sufficient, I think we don't need to get into this specific topic. However, if you would like to see benchmarks showing the difference in performance between float and integer distance units, I can set my recall benchmark for HGG here to test both float and int cases (with integers being converted to floats), then we can compare the two on equal footing on a real-world dataset that exercises fast distance calculation and data-locality (and thus will stress the comparison operation).
Let me know if the type Unit: Ord + Copy + Zero
(with Zero
coming from here) solution will work for you. I know that it doesn't support f64
directly, but it will allow you to set the unit to a wrapper that ensures total ordering and produces the unit of distance for two things which are equal to each other. If this is good, I think I can make this change for you pretty quickly.
Ord + Copy + Zero
looks perfect. In the implementations we should make sure that NaN checks only happen in debug mode, not release mode, for perf reasons.
Ord + Copy + Zero
looks perfect. In the implementations we should make sure that NaN checks only happen in debug mode, not release mode, for perf reasons.
Yep, that seems reasonable to me. Here is my update to the docs and the trait bound: https://github.com/rust-cv/space/blob/d47ba8f56ffb85f16f8b24e9312d26113d9449ce/src/lib.rs#L17
Let me know if this looks good, and if so I can update everything. We should also add traits for specific metric spaces as well that you are relying on (such as L2).
Wait why do we still need separate traits for L2 meter space? Are there algorithms that only work for L2 spaces? The algorithms in Linfa only care if metrics satisfy triangle inequality.
Wait why do we still need separate traits for L2 meter space? Are there algorithms that only work for L2 spaces? The algorithms in Linfa only care if metrics satisfy triangle inequality.
You had mentioned dividing the metric before, so I assumed that you were working with a specifc norm. I am not sure that division of the distance metric works in all spaces. Was that only a hypothetical, or is there a situation where you are dividing in Linfa? If so, I would like to see an example, because I am not sure it would work for hamming space, for instance. We might need to figure out exactly what it is that is required by the data structure.
The division was for finding the median in a set of points, which doesn't actually have anything to do with distance metrics. I mentioned division before because I thought you wanted to convert all the floats into ints. We also have a subtraction of distance values. Specifically, we do distance-from-center-of-circle - radius = distance-to-edge
. Are operations like these supported by only a subset of spaces/metrics?
The division was for finding the median in a set of points, which doesn't actually have anything to do with distance metrics. I mentioned division before because I thought you wanted to convert all the floats into ints. We also have a subtraction of distance values. Specifically, we do
distance-from-center-of-circle - radius = distance-to-edge
. Are operations like these supported by only a subset of spaces/metrics?
Subtraction by a radius is perfectly valid, so long as you consider the less than vs less than or equal to cases.
Computing the median by dividing seems like it might cause some problems. Typically, the way I compute the median roughly is that I collect all of the items into a Veclength / 2
. I understand that this is not the true median if you have two distances in the middle of the vector, but in arbitrary metric spaces, the average of two distances is somewhat arbitrary. Is this the reason why the division is required? We might be able to create a trait bound that could help find the average of two distances.
I am going to expand the trait bounds on the distance type to require the types to support addition and subtraction for that use case you mentioned with radius (I assume ball tree?). Just let me know whether you need division as well and what for, and then we can figure out what additional trait bounds might be needed.
I just checked the ball tree code (which I wrote a while ago) and you're right that the median operation doesn't use division. I mixed it up with another implementation which uses midpoints instead of median. However, there's an operation that averages several points to find the center of a sphere. This requires division for the points, and it may be a good idea to restrict the algorithm to floats to prevent rounding errors. The distances only need to support add/subtract. I'm not sure if any arithmetic type bounds need to be expressed in space
, since they can just be placed in the ball tree type signature. On another note, the library we use for KD trees only accepts floats, probably because it most likely also uses division.
However, there's an operation that averages several points to find the center of a sphere.
Based on what I am reading on the Wikpedia page, I think that this would be a specific implementation of the Ball Tree that relies on having f32
vectors as points in L2 space, and the central point could even be chosen by other processes (such as the point closest to the geometric median). You might be able to relax this floating point requirement so that integers can also be used by setting the radius of the ball based on the median and the center of the ball using an existing point in the data set using a process other than the mean.
I would also add that if the ball center is chosen by the centroid, then that assumes that the underlying metric space uses an L2 distance metric. See here for more details: https://en.wikipedia.org/wiki/Centroid#Of_a_finite_set_of_points. The centroid minimizes the sum of squared euclidean distances specifically, and so this would not be relevant for other metric spaces.
If you would like to generalize the ball tree to arbitrary dimensions, one option could be to find the point which minimizes the sum of the distances to all other points, which would be closest to the geometric median. The only problem with this algorithm is that summing a large number of distances could overflow an unsigned integer, so if you would be interested in that, we will need to come up with an additional trait bound that allows two iterators over distances to be compared for their sum to work around the overflow problem. For floats, we would probably want to do this too anyways, since we should probably use a Kahan sum. Additionally, you could choose the center of the ball randomly. Either way, I just want to point out that taking the average to compute the centroid is specific to L2. It would not work on, for instance, strings using levenshtein distance. However, you could create a ball tree over strings using levenshtein distance if the center point is chosen using a different method such as minimizing the sum of distances.
Basically, if we need an additional trait to handle one of these extra situations, we can do that.
On another note, the library we use for KD trees only accepts floats, probably because it most likely also uses division.
Yes, this seems reasonable. KD trees over L2 floating vector spaces should probably be specific to floats, whereas KD trees over finite L2 spaces should probably be implemented separately (such as point clouds using integer morton encoded coordinates).
KD trees also cannot abstract over all metric spaces, of course, and for them we could use an L2Metric
trait (see previous comment for rough definition) to be used by the Knn
trait implementation.
How would we calculate the geometric median for non-L2 spaces? Do we use an approximation? In that case how slow would it be? We should probably have different ways to calculate geometric medians/centroids, one for L2 spaces and one for other spaces. It seems possible to generalize ball trees across all types of spaces, so I'm willing to do that within Linfa once the necessary machinery is complete. Also, is it possible to have a non-L2 space with floats? In that case we won't be able to restrict KD trees to L2 spaces by bounding its numerical type to Float
, so we'd need different metric traits for different spaces.
BTW, generally speaking, do data structures that support KNN operations also support range queries?
Another neat HNSW implementation https://github.com/InstantDomain/instant-distance
instant-distance
only supports f32
, so I don't think it fits into our trait.
Nearest neighbour algorithms are an item in the roadmap. They're helpful in clustering algorithms, particularly DBSCAN. There are a number of existing implementations of advanced nearest neighbour algorithms in the Rust ecosystem, but they are not compatible with
ndarray
so they'd need to be modified to be included in Linfa.Existing algorithms I found:
We should pick at least one of the O(logN) algorithms to integrate into Linfa, along with linear search due to its simplicity. I'm not sure how many of these algorithms we really need since I don't know the detailed tradeoffs between them.