giacbrd / ShallowLearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
GNU Lesser General Public License v3.0
198 stars 30 forks source link

make hash function faster #13

Open giacbrd opened 7 years ago

giacbrd commented 7 years ago

Iterating over documents and hashing words is almost an order of magnitude slower. What can we do:

prakhar2b commented 7 years ago

hi @giacbrd , I looked into the code for HashIter and hash in utils.py, I would like to work on this, and later integrate it into the gensim's LabeledWord2Vec code.

Please guide me a little about the right approach to do this. Also, what do you mean by

but also the fastest hash functions seem slow

Thanks

giacbrd commented 7 years ago

Hi, I think here the best approach would be to define a data structure that embraces hash buckets, which could substitute the standard dictionary used in Word2Vec's vocab.

In Gensim's Word2Vec the vocabulary of words is the "main" data structure of the algorithm, but the choice of a dictionary is hard-coded. It would be preferable having the possibility of setting a custom hash map, for example one that projects keys to N (fixed size) values.

In ShallowLearn we transform words to hashes in advance (thus loosing the actual words), these hashes substitute words in the algorithm.

Changing the Gensim approach would require a consistent refactoring, that should be part of the general improvement to the Word2Vce code architecture.

Answering to the quote: I think I tried to define the hash function in Cython but it was not faster that the one currently in the codebase

prakhar2b commented 7 years ago

@giacbrd In Facebook's fastText, ngram features are hashed into a fixed number of buckets, in order to limit the memory usage of the model. What is the purpose of transforming words to hashes in advance in shallowLearn.

Can you make it more clear please - thus loosing the actual words can't we access those words from the mapping later on ?

giacbrd commented 7 years ago

The purpose is the same, when you iterate over words and immediately transform them in hash(word) % buckets your vocabulary size will be <= buckets. The problem of this approach is that the vocabulary (a standard python dictionary) will contain word buckets, so if you don't save a map of word_bucket -> word you loose the word embedding information. This information, in the case of the supervised models, can be pointless: the goal of this models is not to obtain word representations.

This part is very clear in the code, e.g. https://github.com/giacbrd/ShallowLearn/blob/master/shallowlearn/word2vec.py#L218

The hash trick is necessary if you want to expand the feature space with n-grams of characters of words, because otherwise that space is going to explode!