stefankoegl / kdtree

A Python implementation of a kd-tree
ISC License
365 stars 118 forks source link

Correcting the kNN functionality. #19

Closed jhagege closed 8 years ago

jhagege commented 10 years ago

Abstract

Corrected the kNN functionality: Using a bounded priority queue (with bound=k) in order to save the k best results so far. I'm not sure but it looks to me a more elegant and more efficient solution (the whole logic of inserting and sorting is encapsulated inside the BoundedPriorityQueue class). It's your call if you think that you should keep the results logic as it was or use mine with the BoundedPriorityQueue logic. This has to do more with code elegance than pure functionality.

Functionality

Regarding functionality, the two most important edits are the following:

First edit

#get new best
- bestNode = next(iter(sorted(results, key=get_dist))) # It seems to me that you are fetching the best-node,
#which is having the smallest distance in the results list, and not habing the BIGGEST distance in the 
#results list. 

+ maxDist = results.max() # Instead we should fetch the MAX distance so far in the 
# results list and compare it.

(P.S: I'm having a hard time justifying you the correctness of the algorithm because there are a lot of details to follow along) so it would seriously help a great deal to go over this paper (less than half an hour, very clear and concise) in order to understand the logic here. (especially p.10-11)

Second Edit

+ if lineIntersects or results.size() < k 

The results.size() < k (BPQ isn't full) is critical because if we prune out parts of the tree before we have made at least k guesses, we might accidentally throw out one of the closest points.

Remarks

I'm a recently graduated computer scientist so I'm sure I could improve tons about coding style (for sure the tests don't look so great) and clearness but I tried to be as clear as possible.

I pushed my test files in order for you to test the functionality before/after the change but it's your choice if you want to include them in the new version. (You will need to pip install geopy to install dependencies) I discovered that the results I got were inconsistent when running kNN so I investigated it and I believe I identified a problem in the algorithm logic. Among others, it would be possible that I insert 15 nodes, searches for the 10 closest nodes, and receives only 5. I tried to describe above the reasons why I believe it did that.

I'd be happy to hear if I could make any improvements in my workflow for collaborating with open-source code, because it's my first time doing so and for sure I have tons to learn from you.

Maybe we could make a Skype sometimes.

According to the tests I ran, the functionality is now correct, but you should double-check that I didn't break anything in your code by making those changes..

I owe you a great deal anyway for this great kDD implementation, and I hope you will consider my pull request.

Best regards.

stefankoegl commented 10 years ago

Thanks for your pull request! As far as I could see at a first glance, the changes look good :) There a quite a few things that do not match the existing code style / best practices, but I've commented everything I could find so far.

Please have a look at my comments and let me know if you have any question.

stefankoegl commented 10 years ago

Also the travis tests did not pass (as you can see on https://travis-ci.org/stefankoegl/kdtree/builds/27852229). This was due to a print statement which failed on Python 3.

jhagege commented 10 years ago

Thank you very much for this thorough feedback. It will help a lot without a doubt to improve. I'll try to update the code ASAP to reflect your comments, but right now I have urgent deadlines so it will take some time..

I'll keep you updated!

stefankoegl commented 9 years ago

Did you have time to look at my comments so far?

jhagege commented 9 years ago

Hi Stefan, Unfortunately I am pretty busy at the moment and I won't have time soon to make a pull request anytime soon. This is definitely on my todo-list and I hope I can make it soon enough :-)

Good luck!

On Fri, Sep 26, 2014 at 6:12 PM, Stefan Kögl notifications@github.com wrote:

Did you have time to look at my comments so far?

— Reply to this email directly or view it on GitHub https://github.com/stefankoegl/kdtree/pull/19#issuecomment-56974965.

uozuAho commented 9 years ago

I think this is a good improvement, however there is a bug in _search_node(). 'absoluteDistance' needs to be squared before comparing it to 'maxDist', otherwise you end up searching the whole tree.

As it stands, @cyberjoac 's current kdtree.search_knn() outperforms this one by only a slim margin. I think this means the current search_knn() is also searching the whole tree unintentionally (and also sorting the best k repeatedly, which is bad for large k). Square 'absoluteDistance' in @cyberjoac 's _search_node, and search time decreases from (I think) N to log_N - a very nice speedup.