spotify / annoy-java

Approximate nearest neighbors in Java
Apache License 2.0
138 stars 46 forks source link

Euclidean Index and test against C++ version. #2

Closed nicolamontecchio closed 8 years ago

nicolamontecchio commented 8 years ago

Changes:

Note that the results do not exactly match between c++ and java versions, though they do largely overlap (at least 5 results out of 10 in common). I'll keep digging to figure out exactly what's different. However in the Angular index case, the code in this PR gives exactly the same results as before (i.e. results between java and c++ were already slightly different).

yonromai commented 8 years ago

@nicolamontecchio really cool stuff! Please see comments ;) Did you do any perf (speed) testing?

erikbern commented 8 years ago

Btw what are you using an Euclidean index for?

nicolamontecchio commented 8 years ago

i'll take a look at these tomorrow during a very long (thanks amtrak) train ride to boston

nicolamontecchio commented 8 years ago

See responses inline and code changes. I run IntelliJ's code formatter, should be properly formatted now

nicolamontecchio commented 8 years ago

on master: Avg. time per query: 0.076ms

this branch: Avg. time per query: 0.079ms

The dataset tested is 1M x 8, query times are averaged over 1M queries

a1k0n commented 8 years ago

Aha, so it is slower! :)

I was mainly concerned about the amount of garbage generated during searching, but maybe it's NBD since it's all virtually stack-allocated garbage. So I guess that's fine.

erikbern commented 8 years ago

@rohanag are you requesting a lot more than 10 neighbors because quality is better? you should really consider adding support for search_k in the java version

rohanag commented 8 years ago

@erikbern yup cos of quality. yeah will consider adding it

erikbern commented 8 years ago

actually probably should be less of a speedup in Java – most of the speedup in the Python/C++ version is that you avoid creating big expensive Python objects. Some of the speedup is caused by using a partial sort as well – instead of sorting 10000 elements, turns out STL has this nice std::partial_sort function: https://github.com/spotify/annoy/blob/master/src/annoylib.h#L574

nicolamontecchio commented 8 years ago

tested w/ 1k NN instead of 10; speed is again pretty much the same from master to this branch

yonromai commented 8 years ago

@nicolamontecchio pushed to maven central, be aware that it takes a while propage though.

nicolamontecchio commented 8 years ago

:100:

erikbern commented 8 years ago

:neckbeard: