An possible improvement

TheCodeHere commented 4 years ago

1) In the querying operation have the option of the k-NN of the same class as the query_data. 2) An option that allow you execute the querying operation over the same data (query_data = data) avoiding calculating distances between them selfs (distance = 0).

This will be of great help finding target neighbors in LMNN (large margin nearest neighbors).

lmcinnes commented 4 years ago

I'm not sure quite what you mean in suggestion one -- can you explain a little further?

For option two there used to be an option that tried to detect passing in the training data as query data and just returned the neighbor graph -- we could instead initialize with the neighbor graph. Would that suffice, or did you have more complex plans in mind?

TheCodeHere commented 4 years ago

Let me explain myself. Currently i'm working with the Large Margin Nearest Neighbor (LMNN), the algorithm needs to compute a priori the k nearest neighbors with the same class label of each sample input, as determined by Euclidean distance. I'm currently trying to improve this step with NN-Descent. it would be necessary a function with the training data and class label as inputs and return the k nearest neighbors with the same class label of each sample input in the training data (avoiding calculating the distance of each sample input from itself of course, e.g., Distance(x_i,x_j)==0 where x_i == x_j).

lmcinnes commented 4 years ago

Ah, I see what you want to do now. It isn't so obvious how to integrate that into the code in a way that wouldn't be a very special case handling. It would likely be easier to simply split the training (and test) data by label and have n_classes many indices, and simply query in the index corresponding to the label of the query point.

TheCodeHere commented 4 years ago

Well, yes. But doing that only fix half of the problem. The function will still compute the distance of each sample from itself. I mean, i could look for the k+1 nearest neighbors and in some weird cases i have notice that the distances are not always sorted (the distance of the sample x_i from x_i is not the first in the vector). This would imply extra work from the user.

Also i was thinking in a function that could compute the k most representative data points of each class, i mean the k-samples that most minimize the distance with the other data points of the same class. It could be interesting.

lmcinnes commented 4 years ago

I would be happy to see a pull request to implement such a thing, but given that a user can work around these issues on their end I don't see this as a high enough priority feature for me to spend time implementing it myself.

jayaram-r commented 4 years ago

Adding my 2 cents here since I had to handle a similar problem. Suppose you build a KNN index on a set of points data as follows:

index_knn = NNDescent(data, **params)

The nearest neighbors of data can be accessed from the KNN index as follows:

nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]

Each row of nn_indices and nn_distances has an entry corresponding to the point itself, which we want to remove. Ideally, we would expect that nn_indices[i, 0] == i and that nn_distances[i, 0] = 0. However, if the data has some overlapping points this may not be true. Hence, without any assumptions, I wrote this little function that removes the "self neighbor" entries from nn_indices and nn_distances. It is numba compiled to be fast.

import numpy as np
from numba import njit, float64, int64
from numba.types import Tuple

@njit(Tuple((int64[:, :], float64[:, :]))(int64[:, :], float64[:, :]))
def remove_self_neighbors(index_neighbors_, distance_neighbors_):
    """
    Given the index and distances of k nearest neighbors of a list of query points, remove points from their
    own neighbor list.
    :param index_neighbors_: numpy array of the index of `k` neighbors for a list of points. Has shape `(n, k)`,
                             where `n` is the number of query points.
    :param distance_neighbors_: numpy array of the distance of `k` neighbors for a list of points.
                                Also has shape `(n, k)`.
    :return: (index_neighbors, distance_neighbors), where each of them has shape `(n, k - 1)`.
    """
    n, k = index_neighbors_.shape
    index_neighbors = np.zeros((n, k - 1), dtype=index_neighbors_.dtype)
    distance_neighbors = np.zeros((n, k - 1), dtype=distance_neighbors_.dtype)
    for i in range(n):
        j1 = j2 = 0
        while j1 < (k - 1) and j2 < k:
            if index_neighbors_[i, j2] != i:
                index_neighbors[i, j1] = index_neighbors_[i, j2]
                distance_neighbors[i, j1] = distance_neighbors_[i, j2]
                j1 += 1

            j2 += 1

    return index_neighbors, distance_neighbors

Query the k + 1 nearest neighbors of points in data and then call the above function as follows:

nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]
nn_indices, nn_distances = remove_self_neighbors(nn_indices[:, :(k + 1)], nn_distances[:, :(k + 1)])

Hope this helps!

jayaram-r commented 4 years ago

@lmcinnes Is it correct to assume that the output of the query method of NNDescent returns the neighbors sorted by increasing distance to the query point? Thanks.

TheCodeHere commented 4 years ago

Technically you can make that assumption, but in some cases, like mine, i have some points that overlap and appear in first place, instead of the main point. So, the position or even the existence in the vector of the measured point can not be assume if you have several overlapping point.

lmcinnes / pynndescent

An possible improvement #96