piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.53k stars 4.37k forks source link

add 'Word Mover's Distance' implementation to gensim? #482

Closed gojomo closed 6 years ago

gojomo commented 8 years ago

"From Word Embeddings to Document Distances" (Kusner et al 2015: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf) introduces the "Word Mover's Distance" (WMD), a novel distance-between-text-documents measure. It is an adaptation of another distance metric, "Earth Mover's Distance" (EMD), introduced(?) in "A Metric for Distributions with Applications to Image Databases" (Rubner et al 1998: http://www.cs.jhu.edu/~misha/Papers/Rubner98.pdf), that is useful for comparing images and can be calculated as a special case of much-older transportation problem optimizations. For text, the WMD leverages word2vec vectors of the documents' individual words, in a way that seems to outperform simple combinations (sum/mean) of those word vectors.

There's a blog post from OpenTable with some impressive examples of "similar sentences" using WMD on restaurant reviews: http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

The paper reports strong kNN classifier results on document classification, and intuitively, the method seems amenable to the selection (or even synthesis) of 'canonical' or 'borderline/contrastive' text segments, perhaps to assist iterative classification tasks.

The author's code is available as a ZIP bundle (http://matthewkusner.com/#page2) – but its python wrappers depend on some older C EMD code of unclear licensing status. There's a bunch of other EMD code around for other projects (esp. image similarity measures) that also might be a good starting point.

An implementation (perhaps in optimized cython) for gensim might be widely useful.

(Further out: it might be interesting to use WMD to induce a document-embedding. That is, train doc embeddings with random draws of 3 documents, nudging the embeddings so that the relative which-two-are-closest is the same in the embedding space as in the WMD metric.)

piskvorky commented 8 years ago

CC @tmylk: backlog of open source tasks.

tmylk commented 8 years ago

Added to https://github.com/piskvorky/gensim/wiki/Ideas-&-Feature-proposals

piskvorky commented 8 years ago

@olavurmortensen has taken up this issue now and is working on the integration of WMD into gensim.

Regarding the "unclear license": let's clarify with @mkusner. We definitely don't want to include anything without a license / LGPL non-compatible.

gojomo commented 8 years ago

This may be a preferable C codebase to build on – http://www.ariel.ac.il/sites/ofirpele/FastEMD/code/ – as the authors of it and the related paper make a case for it (and their related "EMD-hat" metric) being faster/better than earlier methods, and it has a clear BSD license.

There's also an existing python wrapper for it: https://github.com/wmayner/pyemd

That pyemd project is missing interfaces to some of the interesting capabilities of the underlying code, but they might be pretty easy to add:

olavurmortensen commented 8 years ago

pyemd is able to handle sequences of differing lengths, which mkusner's version is not. Although, at this point I'm guessing what the correct input to pyemd's emd function are, because I can't find an explanation.

I found an odd thing in both implementations implementation. If I add the same word to both sentences, the distance decreases; example: "two three" and "four five" are more distant than "two three one" and "four five "one". From my understanding of EMD, this does not make sense, as the distance from "one" to "one" is zero, and thus should add nothing to the EMD.

I'm working on this here at the moment.

mkusner commented 8 years ago

Hey all! I would absolutely love to include WMD in gensim. It's such a great package!! In terms of an implementation, Vlad Niculae (https://github.com/vene) at Cornell has made a very awesome scikit_learn-friendly version of the Word Mover's Distance that is much more robust and well-documented than my original code release. He just released a blog post with this code here: http://vene.ro/blog/word-movers-distance-in-python.html Maybe @olavurmortensen can merge any code with this that isn't overlapping? It's very exciting to have all of these implementations!

As to the comment @olavurmortensen raised about the distance decreasing when "one" is added to both documents: this is because WMD currently normalizes the word counts by the total number of words in the document. Thus, you are right that the distance between "one" and "one" zero, but the distance between the two documents without "one" is: D = 0.5_d("two") + 0.5_d("three")

(where d("two") is the Euclidean distance between vector representing "two" and either the vector for "four" or "five" depending on the optimal EMD distance, and same for d("three")) And the distance with "one" is:

D = 0.33_d("two") + 0.33_d("three") + 0.33*0

So the distance shrinks. For us it made sense to think of document counts as a distribution over words and the WMD as the cost of morphing the distribution of one document into another. But if this isn't what you're looking for then you can remove this normalization and you'll get the behavior you're looking for. Let me know if you have any more questions about this.

Let us know about what would be next best to do to work WMD code into gensim!

gojomo commented 8 years ago

I see @vene's work also uses @wmayner's pyemd wrapper of Pele & Wermen's code.

The treatment where adding 'one' to both distributions results in a smaller-distance better fits my intuitive understanding of what we'd usually like the WMD to do, when doing text-similarity. Two sentences that share more words should be closer: their meanings have greater overlap.

The reason I'd like to see pyemd (and gensim) offer an interface into the handling of unmatched 'mass' and the net 'flows' is that I have a hunch that could offer some interesting possibilities for further characterizing the difference between texts. (For example the unmatched mass, or the flows-between-most-distant-word-pairs, may be useful as a more-than-just-scalar indication of what topics are unique to each side of a comparison.)

piskvorky commented 8 years ago

Great to have a working reference implementation! Thanks so much for all the info and tips @mkusner :)

@vene are you OK with using parts of your code for gensim? I haven't checked your blog post in detail yet (in a minute!), but I'm thinking some parts may be directly transferable.

I'm not excited about the duplication of effort obviously, but WMD seems like a very useful functionality. We definitely want this in "core gensim", rather than relying on external code / external lib release cycles or APIs.

vene commented 8 years ago

Sure, I'd be happy if my code could help, and I could also lend a hand maybe.

@gojomo I really like your thoughts on extracting the actual flows/matchings! WMD, as it is, is great for KNN classification, but I tried using it for feature extraction between pairs of texts, and I didn't get results as exciting as it might have been with access to the flow.

About penalty for unmatched mass, I'm not sure the intuition is, I'd love to hear your thoughts!

This is all really exciting and it would be great to collaborate on this!

olavurmortensen commented 8 years ago

Here's a plan that I worked out with @tmylk:

I will be drawing inspiration from @vene's blog during the process.

wmayner commented 8 years ago

Hey folks,

Really cool to see that pyemd might be useful for your project! Because of the interest, I finally got around to merging the PR that @vbraun submitted to expose the extra_mass_penalty argument. I just pushed v0.2.0 with that addition.

Getting the flows from the algorithm is something others have expressed interest in as well. Unfortunately I'm super busy at the moment and I won't be able to work on that anytime soon, but it is possible to pass a NumPy array to a C++ function via Cython so that the C++ function can then fill it with data—see, for example, this StackOverflow question—so implementing it shouldn't be too difficult. I'd be happy to help with a PR if someone wants to give it a shot.

@olavurmortensen, check the docstring for the emd function—it tells you what the correct input is (you can use help(emd), or if you're using ipython, you can do emd?).

Cheers, Will

piskvorky commented 8 years ago

Thanks @wmayner !

What are your future plans regarding pyemd? @tmylk is considering subsuming (some of) pyemd code inside gensim. It will make the release cycles and custom adjustments and maintenance easier for us, but adds extra burden on "syncing" potential changes from your upstream repo. So I'm not sure.

olavurmortensen commented 8 years ago

I'm working on this on a branch of my own fork of Gensim, and have submitted a PR.

wmayner commented 8 years ago

@piskvorky, it's unlikely that our use of the EMD will change significantly in the foreseeable future, so right now I'm not planning on doing much more work on pyemd. I'd say forking it and merging with your codebase so you have more control makes more sense, since there won't be many upstream changes to incorporate, if any.

Cheers, Will

ddofer commented 8 years ago

Hi, I heard yesterday from Gensim's community manager in Israel that you guys have added an implementation of WMD (atop Word Vectors) in Gensim; From what I see here it seems that it's not actually implemented yet in gensim itself?

olavurmortensen commented 8 years ago

@ddofer This is partly true. Gensim prepares the text data to be input to a solver in the pyemd library. This solver finds the EMD distance (which inspired WMD), and essentially solves an optimization problem known as the "transportation problem".

To my knowledge there is no plan to implement a EMD solver directly in Gensim.

Did this answer your question @ddofer?

gojomo commented 8 years ago

Noticed a small error in WMD_tutorial.ipynb: it currently suggests about the GoogleNews vectors: "It just so happens that the vectors we have downloaded are already normalized, so it won't do any difference in this case." In fact they're not natively normed – only a similarity-operation or explicit init_sims() creates the normed version.

tmylk commented 7 years ago

@RishabGoel Could you please update the ipynb tutorial with this correction?

RishabGoel commented 7 years ago

@tmylk Sure!

piskvorky commented 7 years ago

@RishabGoel @tmylk any progress?

tmylk commented 7 years ago

By the way, @gojomo the flows are available in PyEMD 0.4.1 (Reply to https://github.com/RaRe-Technologies/gensim/issues/482#issuecomment-154752979)

wmayner commented 7 years ago

@tmylk, @gojomo, @piskvorky, I just updated the requirements and pushed a new version—PyEMD 0.4.2 now works with Python 3.3 and NumPy 1.9, so you should be able to get the flows in Genism now 👍

tmylk commented 7 years ago

@wmayner Thanks. Confirm it works. Version unpinned. Looking forward to using the flows.

menshikh-iv commented 6 years ago

Implemented by @olavurmortensen in #521