scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.43k stars 25.26k forks source link

BallTree "query_radius" method giving inconsistent results when inputted radius is distances obtained from "query" method #18087

Closed carlosg-m closed 3 years ago

carlosg-m commented 4 years ago

Describe the bug

1) A BallTree is created using a set of points that include duplicate values, representing different objects in the same spatial position; 2) The tree is queried using a second set of points, to determine the distance to the first closest neighbor; 3) Using the distances as radius, the tree is queried to obtain all the indexes from the first set of points that are closest neighbours with the second set. In situations where 2 or more key points have the same distance and are closest to a query point, I want to obtain all key points indexes. 4) However probably due to rounding errors some key indexes will not appear in the query_radius output.

Steps/Code to Reproduce

keys = np.array([[0.55039766, 0.84536002],
                 [0.55039766, 0.84536002],
                 [0.61324973, 0.05385278],
                 [0.61324973, 0.05385278],
                 [0.61324973, 0.05385278],
                 [0.54551389, 0.9451975 ],
                 [0.54551389, 0.9451975 ],
                 [0.76230722, 0.83559768],
                 [0.76230722, 0.83559768],
                 [0.78700191, 0.61089515],
                 [0.78700191, 0.61089515]])

tree = BallTree(keys, metric='l2', leaf_size=5)    

query = np.array([[0.5488135 , 0.71518937],
                  [0.60276338, 0.54488318],
                  [0.4236548 , 0.64589411],
                  [0.43758721, 0.891773  ],
                  [0.96366276, 0.38344152],
                  [0.79172504, 0.52889492],
                  [0.56804456, 0.92559664],
                  [0.07103606, 0.0871293 ],
                  [0.0202184 , 0.83261985],
                  [0.77815675, 0.87001215]])

distance, index = tree.query(query, k=1)

result = tree.query_radius(query, r=distance.ravel())

Expected Results

array([array([1, 0]), 
       array([10,  9]), 
       array([1, 0]), 
       array([5, 6]),
       array([10,  9]), 
       array([10,  9]), 
       array([5, 6]), 
       array([2, 3, 4]),
       array([1, 0]), 
       array([7, 8])], dtype=object)

Actual Results

array([array([], dtype=int64), 
       array([10,  9]), 
       array([1, 0]),
       array([5, 6]), 
       array([10,  9]), 
       array([10,  9]),
       array([], dtype=int64), 
       array([2, 3, 4]), 
       array([1, 0]),
       array([], dtype=int64)], dtype=object)

Versions

System:
    python: 3.6.9 (default, Jul 17 2020, 12:50:27)  [GCC 8.4.0]
executable: /usr/bin/python3
   machine: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
       pip: 19.3.1
setuptools: 49.2.0
   sklearn: 0.22.2.post1
     numpy: 1.18.5
     scipy: 1.4.1
    Cython: 0.29.21
    pandas: 1.0.5
matplotlib: 3.2.2
    joblib: 0.16.0

Built with OpenMP: True
jnothman commented 4 years ago

I suspect there is no reasonable way we can solve this, due to the way floating point numbers work. Add a small value to the radius you query with.

carlosg-m commented 4 years ago

I suspect there is no reasonable way we can solve this, due to the way floating point numbers work. Add a small value to the radius you query with.

A possible solution would be to include an option in the query method to allow multiple indexes to be returned, even if k=1. I don't understand why this is not the default behaviour.

I'm currently using a rtree to index all my objects that have geographic coordinates, which means I have to project the points into a cartesian plane. Sklearn's BallTree is fast and supports Haversine distance meaning that I can use WGS84.

Another option would be to create the tree using only unique coordinate pairs, and then join all corresponding objects using latitude and longitude as keys in Pandas, which is turning out to be a slow operation that takes more time than the actual BallTree, even if use Lat/Lon as the DataFrame index.

carlosg-m commented 4 years ago

I haven't tested this out, but adding epsilon to the distance might do the trick:

result = tree.query_radius(query, r=distance.ravel()+np.finfo(np.float64).eps)
cmarmo commented 3 years ago

@carlosg-m, is the suggested solution fixing your issue? May I close it? Thanks!