Closed raymondj-pace closed 2 years ago
The issue is that minkowski with p< 1 is not a metric (no triangular inequality). Then the question is do we want to support similarity measures that are not metrics ?
Yes, I agree it's not a metric. My advisor who spent his entire life studying distance (Euclidean and non-Eucliean) would say that Minkowski < 1 has much value. I have a problem where I need to iterate over different values of p for Minkowski but there is a way around this issue (for p < 1). I can use a callable for my own Minkowski method and pass a parameter like this.
Here "minkowski_distance" is my method...
mink_p = 0.5
step = 0.1
while mink_p <= 2.5:
neigh = KNeighborsClassifier(n_neighbors=1, metric=minkowski_distance, metric_params={"minkowski_param" : mink_p})
neigh.fit(X_ref_normalized, y_ref)
...
mink_p += step
mink_p = round(mink_p, 1)
This will solve my original problem and for anyone else that wants p < 0 for Minkowski.
I have another very closely related issue to KNeightborsClassifier - if it needs a new issue I can open it.
What about adding a "weights_params" option? Just like in the above code there is "metric" and "metric_params" but for weights it's just weights (for a callable) - but no weights_params.
I have a case where I need to iterate over the value of p for the weights method and I have do something like this:
for w in [weights_p_5, weights_p1, weights_p1_5, weights_p2, weights_p2_5]:
neigh = KNeighborsClassifier(n_neighbors=k, weights=w)
neigh.fit(X_ref_normalized, y_ref)
...
In this example weights_p_5
is a method that computes the weight as 1/(dist.5), weights_p1_5
computes weights as 1/(dist1.5).
How about adding a weights_params
option just like there is for metric (both callable and a dictionary parameter to pass.)
How about including something like this new param "weight_param":
weight_p = 0.5
step = 0.5
while weight_p <= 2.5:
neigh = KNeighborsClassifier(n_neighbors=1, weights=my_weight_method, weights_params={"weight_param" : weight_p})
neigh.fit(X_ref_normalized, y_ref)
...
weight_p += step
weight_p = round(weight_p, 1)
Thanks -Ray
The issue is that minkowski with p< 1 is not a metric (no triangular inequality). Then the question is do we want to support similarity measures that are not metrics ?
That seems like a typical example of where it's best to allow that behaviour but raise a warning.
@raymondj-pace would you mind sharing some references for us to see how this would be useful and how it would be sensible to apply KNN on a non-metric function like this?
One is here: [Combining Minkowski and Chebyshev - arXivhttps://arxiv.org βΊ pdf] https://arxiv.org/pdf/2112.12549
Scroll down to page 69 - p = 0.5, 0.75.
The same diagrams for p < 1 are also on wikipedia as well: https://en.wikipedia.org/wiki/Minkowski_distance
@scikit-learn/core-devs are we happy with the literature on p<1 here?
I think that I would rather not support it as a DistanceMetric
since it is not a distance metric for the reason given by @jeremiedbb above.
Note that you might still define your own distance using the pyfunc
argument to support those cases potentially, see the documentation.
This functionality is available via scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1)
and KNeighborsClassifier
accepts a callable as parameter.
I'm fine with either the current behaviour (not allowing p<1) or at least throwing a warning.
So should we document this one instead of implementing it?
I am fine with allowing p<1.0 it in KNNClassifier without a warning when using the bruteforce method as it's perfectly fine to compute the datapoints with the lowest pseudo-metric values.
But we should still raise an exception when using ball-tree because then the triangle inequality is required to return correct results.
The DistanceMetric
subclasses could expose an additional attribute true_metric
or something (with a boolean value) to make it explicit
I am fine with allowing p<1.0 it in KNNClassifier without a warning when using the bruteforce method as it's perfectly fine to compute the datapoints with the lowest pseudo-metric values.
Actually a pseudo metric would still satisfy the triangular inequality. It's just that d(x, y) = 0 does not imply the identity x = y.
But still the KNNClassifier
results would still be correct if the neighbors are computed with the bruteforce method. I would just document that it's not a metric for p<1.0 in the docstring.
Actually a pseudo metric would still satisfy the triangular inequality.
Actually minkowski with p<1 does not satisfy triangle inequality (take points (0,0), (1,1) and (0,1), example from wikipedia)
What's our conclusion?
I'll vote for 1. as I don't have seen a convincing use case to allow it, and on top of it, it is available via passing scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1)
to KNeighborsClassifier
.
Actually minkowski with p<1 does not satisfy triangle inequality (take points (0,0), (1,1) and (0,1), example from wikipedia)
Yes I agree, I was correcting my previous comment where I made a bad use of the word "pseudometric".
Still, bruteforce knn is well defined for p<1, so I don't see why we should block it. But I agree that we should prevent running the ball-tree (and even more the kd-tree) algorithms that relies on the metric
/metric_kwargs
parameters to specify a true metric in order to return correct results.
I'll vote for 1. as I don't have seen a convincing use case to allow it, and on top of it, it is available via passing scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) to KNeighborsClassifier.
scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1)
is a potential workaround be would not benefit from the optimized / chunked / parallel Cython implementation of the pairwise distance + reduction computation.
I would allow it for brute force and raise for KD/ball-tree, but no strong opinion.
@ogrisel What is your favorite option for which you would vote?
I would allow it for brute force and raise for KD/ball-tree, but no strong opinion.
+1.
I could live with that. What do the others think? @raymondj-pace, @jjerphan, @adrinjalali @eschibli?
Works for me. :+1:
Sounds like a good resolution to me.
Hi if this still needs to be implemented, can I work on this? @adrinjalali @jjerphan @lorentzenchr @ogrisel
Hi @RudreshVeerkhare. Yes, this needs to be implemented and you can work on it !
/take
Hi @RudreshVeerkhare. Yes, this needs to be implemented and you can work on it !
Thanks @jjerphan π, I've start with setup. Will communicate regarding further doubts are suggestions.π
@jjerphan I'm done with the setup. So before I start writing code, just want to make sure that I'm on a correct track.
Basically, I need to add functionality to allow value of p < 1 and p > 0 only when algorithm is explicitly set to "brute", and also to raise a warning that with p < 1, Minkowski is not a valid metric.
Is that correct? π€
Yes, based on what was concluded in discussions starting from https://github.com/scikit-learn/scikit-learn/issues/22811#issuecomment-1239351666 you are correct up to a clarification for what is to raise.
I think @lorentzenchr meant to raise an error when users set algo="ball_tree"
or algo="kd_tree"
in this case (and I do think this is a better approach over raising a warning).
The logic which sets _fit_method
also needs to be adapted if algo="auto"
is set by users in this case; it starts here:
https://github.com/scikit-learn/scikit-learn/blob/60cc5b596f38d0d236dab34e02c05d98b5a72bad/sklearn/neighbors/_base.py#L585
Thanks @jjerphan for the clarification, I'll start working on it, and will create a WIP PR. Will discuss further on it...
Describe the workflow you want to enable
I would like to be able to use the KNeighborsClassifier with something like:
neigh = KNeighborsClassifier(n_neighbors=2, p=0.1)
The error you get is:
Change the above line to limit minkowski p value to < 0:
if self.metric in ["wminkowski", "minkowski"] and effective_p < 0:
There is nothing wrong with using a p value > 0 and < 1.
Describe your proposed solution
Don't throw an exception if p is < 1.0.
Calculating Minkowski distance is most definitely valid for p > 0 and p < 1. I.e.: 0.1, 0.2, 0.3, 0.4, etc
The only requirement is compute for each dimension raised to the p power. In this case p and after the sum of all dimensions. Take the p-th root of the sum of dimensions: `distance = distance_sump`
There are many cases where it is desirable to compute minkowski > 0 and < 1
Describe alternatives you've considered, if relevant
Write my own kNN classifier with my own minkowski distance calculator:
Additional context
No response