Open giacbrd opened 7 years ago
hi @giacbrd , I looked into the code for HashIter and hash in utils.py, I would like to work on this, and later integrate it into the gensim's LabeledWord2Vec code.
Please guide me a little about the right approach to do this. Also, what do you mean by
but also the fastest hash functions seem slow
Thanks
Hi, I think here the best approach would be to define a data structure that embraces hash buckets, which could substitute the standard dictionary used in Word2Vec's vocab.
In Gensim's Word2Vec the vocabulary of words is the "main" data structure of the algorithm, but the choice of a dictionary is hard-coded. It would be preferable having the possibility of setting a custom hash map, for example one that projects keys to N (fixed size) values.
In ShallowLearn we transform words to hashes in advance (thus loosing the actual words), these hashes substitute words in the algorithm.
Changing the Gensim approach would require a consistent refactoring, that should be part of the general improvement to the Word2Vce code architecture.
Answering to the quote: I think I tried to define the hash function in Cython but it was not faster that the one currently in the codebase
@giacbrd In Facebook's fastText, ngram features are hashed into a fixed number of buckets, in order to limit the memory usage of the model. What is the purpose of transforming words to hashes in advance in shallowLearn.
Can you make it more clear please - thus loosing the actual words
can't we access those words from the mapping later on ?
The purpose is the same, when you iterate over words and immediately transform them in hash(word) % buckets
your vocabulary size will be <= buckets
.
The problem of this approach is that the vocabulary (a standard python dictionary) will contain word buckets, so if you don't save a map of word_bucket -> word
you loose the word embedding information. This information, in the case of the supervised models, can be pointless: the goal of this models is not to obtain word representations.
This part is very clear in the code, e.g. https://github.com/giacbrd/ShallowLearn/blob/master/shallowlearn/word2vec.py#L218
The hash trick is necessary if you want to expand the feature space with n-grams of characters of words, because otherwise that space is going to explode!
Iterating over documents and hashing words is almost an order of magnitude slower. What can we do: