haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

Bug cutting down CoverTree KNN query results #686

Closed lshin-apl closed 3 years ago

lshin-apl commented 3 years ago

Describe the bug The CoverTree k-nearest neighbors query is not returning the k-nearest neighbors.

Expected behavior After creating a CoverTree with 15 data points (length 2 double arrays) with euclidean distance, I submit a query for the 12 nearest neighbors to a single data point. I expect the 12 nearest neighbors to be returned.

Actual behavior Instead, the 12 furthest neighbors are returned.

Input data

double[][]{
  {0., 4.}, 
  {4., 2.}, 
  {7., 2.}, 
  {0., -1.}, 
  {4., -1.}, 
  {8., 0.}, 
  {0., 3.}, 
  {2., 0.}, 
  {7., 4.}, 
  {4., 4.}, 
  {6., 0.}, 
  {8., -4.}, 
  {0., -4.}, 
  {4., 0.},
  {8., 4}
}

Code snippet

tree = CoverTree<>(data, euclideanMetric);
neighbors = tree.knn(new double[]{4., 0.}, 12);
// neighbors: [[8, 4], [0, -4], [8, -4], [0, 4], [7, 4], [0, 3], [0, -1], [4, 4], [8, 0], [7, 2], [6, 0], [2, 0]]

The above code returns neighbors that don't include the actual nearest neighbors to [4, 0] which are [4, 0], [4, -1], and [4, 2]. I believe this is because of the following lines of code in CoverTree.java:

(CoverTree.java lines 557-563)

Arrays.sort(neighbors);

MathEx.reverse(neighbors);

if (neighbors.length > k) {
   neighbors = Arrays.copyOf(neighbors, k);
}

Neighbors are sorted in increasing order of distance, then reversed, and then the first K elements are taken. If neighbors has greater than K elements, this returns the subset of neighbors with the K largest distances from the query point.

Additional context

haifengl commented 3 years ago

Fixed. Please try master branch. Thanks!

lshin-apl commented 3 years ago

@haifengl Thank you! Are you planning a release anytime soon? I have a project that uses Smile through the Maven Repo.

haifengl commented 3 years ago

We are a couple of months away from next release.