mkusner / wmd

Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document Distances"
537 stars 132 forks source link

Current wmd implementation does not match GenSim #7

Open MSardelich opened 7 years ago

MSardelich commented 7 years ago

It is not really an issue, but compatibility with GenSim library.

Using the first twitter corpus texts, i.e.

now all apple has to do is get swype on the iphone and it will be crack iphone that is

and

apple will be adding more carrier support to the iphone 4s just announced,

I get 0.99 distance using GenSim wmd implementation and 2.6625 using this implementation (original and from the paper's author).

At first sight, I thought that it was related to your stop words list. That said, debugging your code I see that the first and second texts become:

apple swype iphone iphone crack apple adding carrier support iphone 4s announced

However, running with the words above, I still get a completely different result. Using GenSim and filtering your stop words (as above) I get 0.96 wmd.

Is there any place where this compatibility is discussed? Could anybody please confirm if the same numbers are returned for different implementations?

This highly impacts the effectiveness of using GenSim implementation to find semantically close texts.

loretoparisi commented 7 years ago

@MSardelich Could you please depict how you get the calculation for each row as soon as you have the output matrix? I have generated a csv output from the pickle output, but I do not figure out how the data is structured. See https://github.com/mkusner/wmd/issues/4

Thanks!