piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.68k stars 4.38k forks source link

Is this a bug in the CBOW code or my misunderstanding? #1873

Closed l11x0m7 closed 6 years ago

l11x0m7 commented 6 years ago

In gensim/gensim/models/word2vec.py, line 394 and line 401

    if learn_vectors:
        # learn input -> hidden, here for all words in the window separately
        if is_ft:
            **if not model.cbow_mean and input_word_indices:**
                neu1e /= (len(input_word_indices[0]) + len(input_word_indices[1]))
            for i in input_word_indices[0]:
                context_vectors_vocab[i] += neu1e * context_locks_vocab[i]
            for i in input_word_indices[1]:
                context_vectors_ngrams[i] += neu1e * context_locks_ngrams[i]
        else:
            **if not model.cbow_mean and input_word_indices:**
                neu1e /= len(input_word_indices)
            for i in input_word_indices:
                context_vectors[i] += neu1e * context_locks[i]

    return neu1e

Shouldn't this be if model.cbow_mean and input_word_indices rather than if not model.cbow_mean and input_word_indices?

menshikh-iv commented 6 years ago

CC @manneshiva @jayantj @gojomo

gojomo commented 6 years ago

No bug. If you've already averaged all the vectors together (cbow_mean true mode), then the error on the (average-magnitude) averaged-vector can also be applied full-magnitude to the the constituent vectors. If instead you've summed them together (the rarer cbow_mean false mode), then the error on the (possibly-high-magnitude) summed vector must be divided equally among the constituent vectors.

See also discussion/example at https://groups.google.com/forum/#!searchin/gensim/cbow_mean|sort:date/gensim/BtN7uB1vgpc/tlvkLXqzJwAJ and prior issue #697.

piskvorky commented 6 years ago

Worth a code comment if it keep tripping up people? It is rather non-obvious.

gojomo commented 6 years ago

Sure, both the Python and Cython lines where this conditional division is done could include a extra clarification that this is intentional and even an excerpt from, or link to, these longer explanations.

l11x0m7 commented 6 years ago

Yeah, that will be great!

piskvorky commented 6 years ago

@gojomo @l11x0m7 can you open a PR for this?

thegrimreaper1992 commented 6 years ago

I also think this is wrong and I'm not convinced by @gojomo 's logic. Backprop is basically the chain rule nothing more. So if neu1e is the derivative wrt l1 and l1 is an average of some weights, then derivative of neu1e wrt the weights would be 1/(number of words) so I also think it's a bug. If you want to convince me you have to convince me with the application of chain rule not the thing you mentioned because I guess what you said is also wrong

gojomo commented 6 years ago

@thegrimreaper1992 The current code works well, and matches the behavior (in both code & rough results) of the original word2vec.c implementation from the authors of the original Word2Vec paper. If neither my explanation nor that concordance convinces you, go ahead and try it your way, and see if anything improves.

thegrimreaper1992 commented 6 years ago

Thanks for your response @gojomo. Honestly, I'm mostly trying to understand if I'm being mistaken or this is just different than what is explained in the paper. It's not really about getting better results. Or if this is a slight modification but they haven't mentioned it I was wondering why it makes sense.

So in other words maybe I'm trying to understand why the original code was implemented that way. Because by applying chain rule I guess it should be the other way around. Do you get my point here?

gojomo commented 6 years ago

Sure, but note it's not just simple backprop, but backprop through this algorithm-specific "input-prep-operation", that's composed (in this particular CBOW-average mode) by the average of all context-window vectors. So at this level, we've essentially left the standard NN model, from which we have an error-at-the-"input"-layer. But that input-layer was composed in a certain way from all the other word-vectors, and that error then needs to be meaningfully applied back to constituent word-vectors.

I think my forum post – https://groups.google.com/forum/#!searchin/gensim/cbow_mean%7Csort:date/gensim/BtN7uB1vgpc/tlvkLXqzJwAJ – outlines why this form of error-divvying makes sense for the average-forward, fanout-backward case. If we were summing all the context-words, then yes, the correction would be divided equally over all sum-constituents. But since we're averaging, the same correction goes back to each – because on the next forward-prop it'll again be divided by the number-of-participating-words.

thegrimreaper1992 commented 6 years ago

Thanks for your response. I read your post in the google groups. There's something I don't understand about your logic there: In practice we're looking for parameters of the model (input->hidden) weights that maximize our loss function. So we have to find the optimal values for them via gradient descent. Your logic in that post(The way I underwstand and correct me if I'm wrong) is that you want the hidden layer to be optimized and so optimize your weights to match the gradients of the hidden layer. (which doesn't make sense to me). In other words, what I'm saying is that your assumption in that post, the part that we want the hidden layer to be adjusted to 10, seems wrong to me, because we don't want that. We just want the weights (input->hidden) to be their optimal value and thus we should find the gradients of those weights. And the change in the hidden layer can be anything, it shouldn't be 5, it might be 25.(In other words, we're not trying to adjust this thing at all but rather the weights themselves). Does this make any sense?

Plus, the writters haven't mentioned anything about a special form of backprop in their paper at all anywhere. And even more specifically, what I'm saying is backed by this paper and explained there: https://arxiv.org/pdf/1411.2738.pdf (look at equation 23 and 1/C part in that)

Does this make any sense? I believe there might have been some reason for doing something like this in the implementation (maybe like cache efficiency or something like that However I don't understand what that is)
Please tell me if I'm mistaken at all. Also do you guys know how is it possible to maybe ask the original authors about this? Because I guess it's really confusing and needs some explanation.

Thanks

gojomo commented 6 years ago

My analogy of using a backprop-like adjustment to correct an average to "10" doesn't involve any sort of hidden-layer. It's just mathematical reasoning that a correction-to-an-average of magnitude N requires a full-correction-to-each-consitutent-value of magnitude N& – not a correction-divided-by-the-number-of-summands like N/count*.

It's not a special form of backprop; it's the CBOW-with-average algorithm they describe, which uses an NN as part of its process, but not its entire process. The preparation of the "input layer" is either via picking one context word-vector (skip-gram) or an average-of-context-word-vectors (CBOW). After the NN has run its course, and corrected the input-layer, that then has to applied to the constituent word-vectors. That's trivial in the skip-gram case (apply the whole correction to the one input word), but then varies in the CBOW case. In the rarer CBOW-with-summation mode, the error needs to divided among all input words. In the more common CBOW-with-average mode, the error can be applied in full to all words, it will still have the desired effect on the next forward-prop because of the averaging that occurs.

I trust the source code from the original word2vec authors more than any reasoning in the 3rd-party Rong "word2vec Parameter Learning Explained" paper. Rong could easily have made the same error here that's confused others.

I can't make it any more clear. Trust me that this matches the author's original intent, works well, and (further) is more consistent in its internal operation across multiple window values. (Any other policy would mean that the magnitude of effective backprop-nudges would vary a lot with different 'window' values. You could approximate the current, 'good' way of doing it by making big, window-related changes to learning-rate alpha, but that'd be a bigger mess.) You can of course try to contact the original word2vec authors if you'd like!

spinkney commented 3 years ago

This paper suggests that the original implementation and, subsequently, gensim's is wrong. https://arxiv.org/pdf/2012.15332.pdf

piskvorky commented 3 years ago

That is interesting, thanks for the link @spinkney ! Any appetite for trying this out in Gensim & benchmarking?

CC also @Witiko , @gojomo

I'm less interested in what is "right / wrong" and more in what works better: We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train.

spinkney commented 3 years ago

Yea, maybe, "wrong" is not the right word. As it clearly works, but not as well.

Witiko commented 3 years ago

@piskvorky @spinkney The preprint looks interesting, but it makes some dubious claims:

However, the CBOW negative sampling implementations in […] Gensim incorrectly update each context vector by Eq. (1), without normalizing by the number of context words.

Gensim does normalize the gradient by the number of averaged input word and n-gram vectors.

[In Gensim,] normalization by number of context words is only done for the ‘sum variant’ of CBOW.

That's not what I see in the code: Gensim normalizes the gradient when cbow_mean=1, which is the default.
In the sum variant (cbow_mean=0), we do not normalize, because the loss function L lacks the 1 / C factor.

Witiko commented 3 years ago

@piskvorky

Any appetite for trying this out in Gensim & benchmarking?
[…]
I'm less interested in what is "right / wrong" and more in what works better.

I am not yet convinced by the accuracy part of the preprint: Due to the issues I discuss above, I suspect the author used the non-default cbow_mean=0, which will bias the results. However, Figure 1 confirms my observations: Gensim's CBOW virtually does not scale with the number of workers (unless something changed since 3.8.3, which I am currently using as my workhorse).

use1543

If the measurements are to be trusted, then this is a big issue with large corpora (am I done training in a week or in two months?) and one of the reasons why #2905 has been moving at snail's pace. I have no idea what the cause could be, but since SG is unaffected, it seems that the issue is in the low-level Cython CBOW functions from *_inner.pyx files, which narrows things down.

gojomo commented 3 years ago

I've given the paper & their code a quick look, after seeing it on HN. You can see some of my comments there.

I'm not yet convinced their analysis is correct or their change offers any net benefits. Their own benchmarks show mixed results, with small differences either way. It'd be interesting if @tmikolov had any comments on their interpretation.

I'm not sure their method of testing Gensim always with an 0.025 starting-alpha, while theirs with an 0.075 starting alpha, is fair/reasonable. (Instead of picking one value based on 'average' performance over many tasks, as long as they were grid-searching many values, why not let Gensim-at-its-best-alpha-each-task compare vs Koan-at-its-best-alpha-each-task?)

Given that they observed that the (seldom-used & I believe even removed from word2vec.c) cbow_mean=0 mode behaves more like their preference, I'm also surprised they didn't try benchmarking that mode (which also implies a different optimal alpha) against their variant. I have a hunch that any marginal benefits on any evaluation from their approach could be replicated in the 'standard' interpretation with some combo of alternate-alpha/cbow_mean=0/etc. B

Certainly, a koan_cbow option in Gensim would be easy to add, if it's useful for testing this variant or getting other benefits.

Looking at their code, it also appears they've completely eliminated the reduced_window-method for stochastic weighting of neighboring words (which is used by word2vec.c, Gensim, & Facebook FastText. I'm not sure the motivation, but it could be a contributor to their different (sometimes-better, sometimes-worse) results on various benchmarks. In addition to options like @Witiko's learned/position-variable weighting, it'd be easy to add a flag to turn off the usual reduced_windows - allowing testing of that variant independently. (Sometimes people want "all tokens in the context window with no weighting", because their underlying data source doesn't have a true ordering - and I usually just recommend something like window=1000000 in such cases, as it means that nearly-always, the effective-window will be the full text. But this hypothetical new flag could help that situation, & others, instead.)

Their use of an 'alias method' for negative-sampling looks like a separate, functionally-neutral but performance-enhancing change. It could be a nice separate optimization. I don't fully understand it yet – I think it'd use more memory than our current 'binary-search-over-list-of-split-points' approach, but remains O(1) (ignoring cache-locality effects) for arbitrarily large vocabularies. (The original word2vec.c approach of pre-filling a giant array was O(1) at a massive cost in memory that then also degraded its real performance due to cache misses, vs the binary-search that I replaced it with a long while back.)

(There are likely other still low-hanging-fruit performance optimizations around some of the negative-sampling, reduced-windows, & frequent-word-downsampling code - such as using a faster RNG or simply reusing a smaller number of random-draws repeatedly. EG: just permute the values {1..window} once per epoch, then cycle through them for each center-word - it's not as random, but probably good enough. Something similar might work for negative-sampling or downsampling.)

Their graph showing Gensim performance gains for more threads in SG but not CBOW doesn't quite match my expectations - I thought they both petered out in gains (& started getting worse) past about a dozen threads, unless using corpus_file mode. It's unclear if they're using corpus_file mode in their tests - if so, their CBOW benchmarks are a surprise, if not the SG results are a surprise. The stark difference suggests there might be a simple fix in the CBOW code.

gojomo commented 3 years ago

A possibly-updated (later 2021?) version of the "Corrected CBOW..." paper is at: https://aclanthology.org/2021.insights-1.1.pdf. After another quick read, I remain unconvinced that this is either a 'correction' or even a 'consistent improvement'.

Firstly, as far as I'm concerned, the Mikolov (et al) implementations in word2vec.c & later Fasttext are the 'correct' implementation of the algorithms as originally specified, definitionally. (To the extent the papers had any underspecified details, as a contemporaneous reference release, I'd say the Mikolov code is authoritative/superceding.) Other improvements/optimizations are of course always possible, but calling the original choice 'incorrect' or 'faulty' seems misguided.

As above & per the tables in the paper, note that their results still show Gensim performing better than their approach on a couple of their evaluations!

Their improvements on other evaluations, with the same training data, aren't that big – & (as is sadly rampant in papers) it's not crystal-clear that they've given preexisting implementations (in this case Gensim) the same breadth of meta-parameter optimizations as their method. In particular, before claiming their approach is better, they should be sure to search the already-supported parameter options Gensim for values that may closely approximate the benefits of their changes, in particular:

  1. Changing the apportionment of updates across CBOW words has an effect very similar to a different alpha schedule, so the proper test is "best alpha in new implementation" vs "best alpha in reference implementation" (which might be higher than typical). Also, a finer-step-schedule of values than just [0.025, 0.05, 0.075, 0.1] could be checked – in case 0.025 is a lucky draw close to the true optimum for their method, while the optimum for the classic approach falls between the steps. Note original word2vec.c would default to a higher alpha in CBOW mode, suggesting the original experiments recognized each mode benefited from different update-scaling.
  2. Testing both cbow_mean & sum modes, as supported by Gensim, which affects the training-updates being considered.
  3. Testing both default on and off for the new shrink_windows option (#3169), which matches a difference between their code & Mikolov-style window-treatment that's evident in their codebase (but unmentioned in their paper) that might also be partially responsible for any improved end-results.

Their speed claims are also clouded by the inclusion of another, independent optimization of the negative-sampling which doesn't change the distribution.

As per my prior comment, it'd be easy enough to prepare a patch with an optional cbow_koan mode, that narrowly changes the gradient calculation according to their main idea, then test that extensively to see where (if anywhere) it uniquely outperforms other parameter-mixes.