Closed nmonath closed 10 years ago
Apparently Python's hash function is great. I ran another test on the 3million word Google news embeddings model.
import Features
import Word2VecExecuter
model = Word2VecExecuter.Word2VecGetModel('/Users/nmonath/Downloads/GoogleNews-vectors-negative300.bin')
Features.NumberOfHashCollisions(model.vocab)
Out[4]: 0
So I ran a test checking how many collisions we get on a particular chunk of data. We should do more thorough testing for the final paper, but I think that these results show that we can be confident in using hashing. There were no collisions in 800k+ entries: