Ability to weight context words by distance from target word for `*2vec` models

zkurtz commented 6 years ago

The window parameter in word2vec controls how far apart two words can be and still directly influence each other's resulting embedding. The current setup is that a given word is in another word's window or it's not, a binary outcome. By analogy to kernel regression, word2vec uses a uniform (or boxcar) kernel to predict target words from input words. So, again by analogy to kernel regression, could we allow the user to specify the kernel as something other than uniform?

For example, when computing the loss, I'd like to be able to assign a weight to each context word, something like w = exp(-beta*k), where k is how far the context word is from the target word, and beta is nonnegative. So allowing the user to select beta would be a start. Alternatively they could directly provide their own function of k.

I've googled a bit but have not found anything related to this proposal for word2vec, but it seems commonsense enough that surely someone has tried it?

piskvorky commented 6 years ago

What I've seen is people use completely arbitrary distances between words, representing e.g. temporal gaps between click events, or physical distances between objects (~"words").

If we add weighting, I'd prefer it to be something really flexible, not just one fixed function family like exp(-beta*k) or the current linear random-context-shortening.

Doing this in a way that's both easy to use and doesn't introduce performance regressions is probably non-trivial.

zkurtz commented 6 years ago

I agree that exp(-beta*k) is restrictive. Another way is to create a window_weights parameter that is None by default (preserving current behavior and performance). A valid non-null value for window_weights is any list of length 2*window consisting of numeric values. This puts some burden on the user to know what they are doing when defining weights, but it is totally flexible, including asymmetric weighting for words-after-target versus words-before-target.

Then the question is how much overhead this adds. Presumably all words pairs used for training already have their positional distance pre-computed (to identify which pairs are not negative samples), which is all that is needed to look up the relevant weight. Then the weight must be multiplied with each input (in the CBOW sum) or each output loss (for skip-gram). I don't understand the codebase enough to see where exactly this would occur.

I think I understand the distances you mentioned, but not sure how people would use such distances. Is this with gensim?

gojomo commented 6 years ago

FYI, the existing Word2Vec/Doc2Vec doesn't treat all distances equally, but rather follows the original word2vec.c implementation in treating the window as a maximum distance, but for each training-pass focused on a particular word, choosing a random effective window value from 1 to window. Thus, immediate neighbors are always part of the training, but more-distant words sometimes are and sometimes aren't – approximating a sort-of-distance-weighting with a minimum of calculations.

zkurtz commented 6 years ago

Good to know! Maybe this is worth adding to the standard documentation. Is this the relevant snippet?

gojomo commented 6 years ago

Yes, and other places reduced_windows appear (such as in doc2vec_inner.pyx. (I believe the similar effective-window variable was just b in original word2vec.c code.)

piskvorky / gensim

Ability to weight context words by distance from target word for `*2vec` models #2114