scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.11k stars 25.4k forks source link

KNeighborsClassifier: allow p value for Minkowski calculation to be less than 1.0 (and greater than 0) #22811

Closed raymondj-pace closed 2 years ago

raymondj-pace commented 2 years ago

Describe the workflow you want to enable

I would like to be able to use the KNeighborsClassifier with something like:

neigh = KNeighborsClassifier(n_neighbors=2, p=0.1)

The error you get is:

File ~/opt/anaconda3/envs/tensorflow-3/lib/python3.10/site-packages/sklearn/neighbors/_base.py:395, in NeighborsBase._check_algorithm_metric(self)
    392     effective_p = self.p
    394 if self.metric in ["wminkowski", "minkowski"] and effective_p < 1:
--> 395     raise ValueError("p must be greater or equal to one for minkowski metric")

ValueError: p must be greater or equal to one for minkowski metric

Change the above line to limit minkowski p value to < 0:

if self.metric in ["wminkowski", "minkowski"] and effective_p < 0:

There is nothing wrong with using a p value > 0 and < 1.

Describe your proposed solution

Don't throw an exception if p is < 1.0.

Calculating Minkowski distance is most definitely valid for p > 0 and p < 1. I.e.: 0.1, 0.2, 0.3, 0.4, etc

The only requirement is compute for each dimension raised to the p power. In this case p and after the sum of all dimensions. Take the p-th root of the sum of dimensions: `distance = distance_sump`

There are many cases where it is desirable to compute minkowski > 0 and < 1

Describe alternatives you've considered, if relevant

Write my own kNN classifier with my own minkowski distance calculator:

def minkowski_distance(row1, row2, p=3):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += abs(row1[i] - row2[i])**p
    return distance**(1/p)

Additional context

No response

jeremiedbb commented 2 years ago

The issue is that minkowski with p< 1 is not a metric (no triangular inequality). Then the question is do we want to support similarity measures that are not metrics ?

raymondj-pace commented 2 years ago

Yes, I agree it's not a metric. My advisor who spent his entire life studying distance (Euclidean and non-Eucliean) would say that Minkowski < 1 has much value. I have a problem where I need to iterate over different values of p for Minkowski but there is a way around this issue (for p < 1). I can use a callable for my own Minkowski method and pass a parameter like this.

Here "minkowski_distance" is my method...

mink_p = 0.5
step = 0.1

while mink_p <= 2.5:

    neigh = KNeighborsClassifier(n_neighbors=1, metric=minkowski_distance, metric_params={"minkowski_param" : mink_p})
    neigh.fit(X_ref_normalized, y_ref)

    ...

    mink_p += step
    mink_p = round(mink_p, 1)

This will solve my original problem and for anyone else that wants p < 0 for Minkowski.

I have another very closely related issue to KNeightborsClassifier - if it needs a new issue I can open it.

What about adding a "weights_params" option? Just like in the above code there is "metric" and "metric_params" but for weights it's just weights (for a callable) - but no weights_params.

I have a case where I need to iterate over the value of p for the weights method and I have do something like this:

for w in [weights_p_5, weights_p1, weights_p1_5, weights_p2, weights_p2_5]:

    neigh = KNeighborsClassifier(n_neighbors=k, weights=w)
    neigh.fit(X_ref_normalized, y_ref)
    ...

In this example weights_p_5 is a method that computes the weight as 1/(dist.5), weights_p1_5 computes weights as 1/(dist1.5).

How about adding a weights_params option just like there is for metric (both callable and a dictionary parameter to pass.)

How about including something like this new param "weight_param":

weight_p = 0.5
step = 0.5

while weight_p <= 2.5:

    neigh = KNeighborsClassifier(n_neighbors=1, weights=my_weight_method, weights_params={"weight_param" : weight_p})
    neigh.fit(X_ref_normalized, y_ref)

    ...

    weight_p += step
    weight_p = round(weight_p, 1)

Thanks -Ray

eschibli commented 2 years ago

The issue is that minkowski with p< 1 is not a metric (no triangular inequality). Then the question is do we want to support similarity measures that are not metrics ?

That seems like a typical example of where it's best to allow that behaviour but raise a warning.

adrinjalali commented 2 years ago

@raymondj-pace would you mind sharing some references for us to see how this would be useful and how it would be sensible to apply KNN on a non-metric function like this?

raymondj-pace commented 2 years ago

One is here: [Combining Minkowski and Chebyshev - arXivhttps://arxiv.org β€Ί pdf] https://arxiv.org/pdf/2112.12549

Scroll down to page 69 - p = 0.5, 0.75.

The same diagrams for p < 1 are also on wikipedia as well: https://en.wikipedia.org/wiki/Minkowski_distance

adrinjalali commented 2 years ago

@scikit-learn/core-devs are we happy with the literature on p<1 here?

jjerphan commented 2 years ago

I think that I would rather not support it as a DistanceMetric since it is not a distance metric for the reason given by @jeremiedbb above.

Note that you might still define your own distance using the pyfunc argument to support those cases potentially, see the documentation.

lorentzenchr commented 2 years ago

This functionality is available via scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) and KNeighborsClassifier accepts a callable as parameter. I'm fine with either the current behaviour (not allowing p<1) or at least throwing a warning.

adrinjalali commented 2 years ago

So should we document this one instead of implementing it?

ogrisel commented 2 years ago

I am fine with allowing p<1.0 it in KNNClassifier without a warning when using the bruteforce method as it's perfectly fine to compute the datapoints with the lowest pseudo-metric values.

But we should still raise an exception when using ball-tree because then the triangle inequality is required to return correct results.

The DistanceMetric subclasses could expose an additional attribute true_metric or something (with a boolean value) to make it explicit

ogrisel commented 2 years ago

I am fine with allowing p<1.0 it in KNNClassifier without a warning when using the bruteforce method as it's perfectly fine to compute the datapoints with the lowest pseudo-metric values.

Actually a pseudo metric would still satisfy the triangular inequality. It's just that d(x, y) = 0 does not imply the identity x = y.

But still the KNNClassifier results would still be correct if the neighbors are computed with the bruteforce method. I would just document that it's not a metric for p<1.0 in the docstring.

jeremiedbb commented 2 years ago

Actually a pseudo metric would still satisfy the triangular inequality.

Actually minkowski with p<1 does not satisfy triangle inequality (take points (0,0), (1,1) and (0,1), example from wikipedia)

lorentzenchr commented 2 years ago

What's our conclusion?

  1. Do not allow p<1 and raise an error (current behaviour). Action: Improve error message and close this issue.
  2. Allow it and raise a warning. Action: PR to allow p<1 with warnings and tests.

I'll vote for 1. as I don't have seen a convincing use case to allow it, and on top of it, it is available via passing scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) to KNeighborsClassifier.

ogrisel commented 2 years ago

Actually minkowski with p<1 does not satisfy triangle inequality (take points (0,0), (1,1) and (0,1), example from wikipedia)

Yes I agree, I was correcting my previous comment where I made a bad use of the word "pseudometric".

Still, bruteforce knn is well defined for p<1, so I don't see why we should block it. But I agree that we should prevent running the ball-tree (and even more the kd-tree) algorithms that relies on the metric/metric_kwargs parameters to specify a true metric in order to return correct results.

ogrisel commented 2 years ago

I'll vote for 1. as I don't have seen a convincing use case to allow it, and on top of it, it is available via passing scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) to KNeighborsClassifier.

  1. is definitely simpler to implement and maintain but is somewhat artificially limiting the potential use cases for this class.

scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) is a potential workaround be would not benefit from the optimized / chunked / parallel Cython implementation of the pairwise distance + reduction computation.

jeremiedbb commented 2 years ago

I would allow it for brute force and raise for KD/ball-tree, but no strong opinion.

lorentzenchr commented 2 years ago

@ogrisel What is your favorite option for which you would vote?

ogrisel commented 2 years ago

I would allow it for brute force and raise for KD/ball-tree, but no strong opinion.

+1.

lorentzenchr commented 2 years ago

I could live with that. What do the others think? @raymondj-pace, @jjerphan, @adrinjalali @eschibli?

jjerphan commented 2 years ago

Works for me. :+1:

adrinjalali commented 2 years ago

Sounds like a good resolution to me.

RudreshVeerkhare commented 2 years ago

Hi if this still needs to be implemented, can I work on this? @adrinjalali @jjerphan @lorentzenchr @ogrisel

jjerphan commented 2 years ago

Hi @RudreshVeerkhare. Yes, this needs to be implemented and you can work on it !

RudreshVeerkhare commented 2 years ago

/take

RudreshVeerkhare commented 2 years ago

Hi @RudreshVeerkhare. Yes, this needs to be implemented and you can work on it !

Thanks @jjerphan 😁, I've start with setup. Will communicate regarding further doubts are suggestions.πŸ˜„

RudreshVeerkhare commented 2 years ago

@jjerphan I'm done with the setup. So before I start writing code, just want to make sure that I'm on a correct track.

Basically, I need to add functionality to allow value of p < 1 and p > 0 only when algorithm is explicitly set to "brute", and also to raise a warning that with p < 1, Minkowski is not a valid metric.

Is that correct? πŸ€”

jjerphan commented 2 years ago

Yes, based on what was concluded in discussions starting from https://github.com/scikit-learn/scikit-learn/issues/22811#issuecomment-1239351666 you are correct up to a clarification for what is to raise.

I think @lorentzenchr meant to raise an error when users set algo="ball_tree" or algo="kd_tree" in this case (and I do think this is a better approach over raising a warning).

The logic which sets _fit_method also needs to be adapted if algo="auto" is set by users in this case; it starts here: https://github.com/scikit-learn/scikit-learn/blob/60cc5b596f38d0d236dab34e02c05d98b5a72bad/sklearn/neighbors/_base.py#L585

RudreshVeerkhare commented 2 years ago

Thanks @jjerphan for the clarification, I'll start working on it, and will create a WIP PR. Will discuss further on it...