nmonath / NLPProject

Repository for our Final Project
0 stars 0 forks source link

Perform Hash Collisions Test #13

Closed nmonath closed 10 years ago

nmonath commented 10 years ago

So I ran a test checking how many collisions we get on a particular chunk of data. We should do more thorough testing for the final paper, but I think that these results show that we can be confident in using hashing. There were no collisions in 800k+ entries:

import Features

all_units = Features.LoadAllUnitsFromFiles('../data_sets/reuters_21578/test/', funit=Features.FeatureUnits.BOTH)

Documents Processed: 03019

len(all_units)
Out[6]: 823639

Features.NumberOfHashCollisions(all_units)
Out[7]: 0
nmonath commented 10 years ago

Apparently Python's hash function is great. I ran another test on the 3million word Google news embeddings model.


import Features

import Word2VecExecuter

model = Word2VecExecuter.Word2VecGetModel('/Users/nmonath/Downloads/GoogleNews-vectors-negative300.bin')

Features.NumberOfHashCollisions(model.vocab)
Out[4]: 0