Open arashsa opened 7 years ago
Word Mover's Distance inherently requires a lot of computation. There might still be possible optimizations in the library gensim uses (PyEMD) or what gensim does – but there are no easy switches to throw for extra speed.
My only suggestions are to ensure you're caching reusable values in your application, and figure ways to only calculate pairwise WMDistances among likely candidates.
@gojomo
Yes, I am aware of the computational complexity. What I don't understand is why it is slower for fasttext vs word2vec? I assume that it might have something to do with fasttext's ngrams, and that the word mover algorithm is computed for each ngram. However, that should still not warrant the overall speed increase when compared to the word2vec class. These are only assumptions.
Do both classes share the same underlying implementation?
How much slower are you seeing with FastText vs non-FastText word-vector sources?
Yes, native FastText word-vectors need to be composed from the full-word-token and all subgrams – so each access will be a bit slower. Also, looking at the gensim WMD code (https://github.com/RaRe-Technologies/gensim/blob/7127eeef8304852da589465b338e2bba14e881f1/gensim/models/keyedvectors.py#L414), I see that each word-vector is re-accessed many times during calculation – if that's a notable drag on FastText models, it could be optimized to use a local cache of the vectors.
If you try this optimization, please share the change & let us know what speedup you see.
@gojomo Unfortunately I did not time the functions. There was a substantial increase though. I will time them later and report back.
Will examine the code and see if there is a possible increase.
RWMD should be much faster than WMD but gives similar results. For example it is used in R text2vec package. However previous attempts at RWMD in gensim have not been performant, but it's on the roadmap.
Yes, I think this is unrelated to fasttext. WMD is slow no matter the embedding algorithm. There's an open PR about using RWMD and WCD, but it looks like it stalled: https://github.com/RaRe-Technologies/gensim/pull/800.
Confirm that is the current status. Closing as resolved
As implied by original reporter, and noted in my comment, the access pattern used by gensim's WMD code is particularly costly for fasttext vectors that need to be repeatedly re-composed from character ngrams.
While reporter hasn't quantified magnitude of fasttext-specific slowdown, that shouldn't be hard AND the fix is easy. (This would be a good student/1st-timer test & fix.)
If the speed performance of the WMD is a big issue, then you can try this out: https://github.com/src-d/wmd-relax I didn't compare the speed results. but would be interested if one does for general settings.
Also, you can try to use soft-cosine similarity, this is pretty fast and nice alternative for WMD, see benchmark results
@Bachstelze Thanks for the pointer to wmd-relax
!
Do you happen to know if it has options to return the "flows" chosen to minimize the WMD value, or "remainder" (if ever doing a partial-match of unequal 'weights')? I have a hunch WMD might work well to check when one text conveys a superset of another (shorter) text's meaning – and then also find third texts that best add exactly what's missing to the shorter text.
I am testing the fasttext wrapper:
Then I am trying the Word Mover's Distance function. In comparison with the doc2vec format, it is very slow. Is there a way to make this function faster?