piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.68k stars 4.38k forks source link

Topic worth investigating over: 'vector rejection' #595

Open shirish93 opened 8 years ago

shirish93 commented 8 years ago

@benschmidt has written an interesting blog post on the use of a method he calls 'vector rejection' to separate words with ambiguous meanings.

During experimentation with a Nepali news corpus dataset, I found his method to be more useful to discard unwanted vectors than the existing method with most_similar.

I have recreated his method (which he has in R) in this gist and have been working with it for the last few days. In my (admittedly limited) series of experiments it seems to have quite a lot of value. Yoav Goldberg has a twitter thread about the operation/post here.

I bring this up because someone might want to look it over/possibly see if this aligns with the project? Please close the issue if you believe otherwise.

edit: correct link.

piskvorky commented 8 years ago

This is very interesting, thanks for the tip @shirish93 !