spotify / annoy-java

Approximate nearest neighbors in Java
Apache License 2.0
138 stars 46 forks source link

casting bugs and minor optimization #11

Closed julsal closed 7 years ago

julsal commented 7 years ago

1- removing bugs related to cast (int -> long | long -> int); 2- optimizing the number of MappedByteBuffer.

julsal commented 7 years ago

The current test set doesn't catch potential casting problems. It was necessary to test with a real large file (>2GB), which was not included in PR because of its size.

erikbern commented 7 years ago

would be nice to add unit tests for this

julsal commented 7 years ago

@erikbern Yes, it would be nice to have better unit tests, but I didn't find a suitable way to test large indexes without a real large file. Regardless, making NODE_SIZE long type, makes all casts mandatory (otherwise, it does not compile).

julsal commented 7 years ago

@a1k0n, is everything ok to merge and publish a new maven version? :) By the way, we are using effectively this version and it is running gratefully so far.

a1k0n commented 7 years ago

Yes, looks good to me. But I'm not the maintainer anymore. I guess this version is somewhat abandoned since @yonromai moved on to a JNI version.

@erikbern or @yonromai could you merge this fix?

erikbern commented 7 years ago

there you go

julsal commented 7 years ago

@Yuanrui1994 I read about annoy's algorithm and found out that it is supported by random projections. @erikbern Could you tell us if random projections make annoy's output non-deterministic?

erikbern commented 7 years ago

yes non deterministic unless you set the seed explicitly

Yuanrui1994 commented 7 years ago

On Jul 24, 2017 1:48 PM, Juliano Efson Sales notifications@github.com wrote:

@Yuanrui1994https://github.com/yuanrui1994 I read about annoy's algorithm and found out that it is supported by random projections. @erikbernhttps://github.com/erikbern Could you tell us if random projections make annoy's output non-deterministic?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/spotify/annoy-java/pull/11#issuecomment-317549716, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUTtrmNtq87duK8sIXbSgJ9v8gQGVP7Xks5sRQMIgaJpZM4OUCLc.

Hi,

Now I am confused. I read the annoy-java code and it seems to be determinstic. Why is it non determinstic and which part is non determinstic?

I might not be clear about my question. I first created a index tree using annoy-python and then loaded the index tree and did some searching using both old version annoy-java and the latest version. And I got different results.

Thanks for your explanation!

Yuanrui

ghost commented 7 years ago

Querying should be deterministic. Index building is random. So you may be encountering a bug.