medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Word2phrase #12

Open dirkgr opened 9 years ago

dirkgr commented 9 years ago

Do you have any intention of porting word2phrase as well?

wko27 commented 9 years ago

Hello!

Ah, hmm we haven't actually used word2phrase at Medallia, but it seems like extending the current implementation would not be too difficult. We just need to extend the vocabulary to include bigrams. This needs to be done in two places:

  1. Extend Word2VecTrainer.train where we read all sentences into the vocabulary
  2. Extend NeuralNetworkTrainer.run which needs to consider bigrams as well as unigrams
dirkgr commented 9 years ago

The original word2phrase does a simpler thing. It just preprocesses the input by combining pairs of tokens with an underscore if they have a high score. The score looks vaguely like PMI. To get bigger ngrams, you run it twice or more times.