Implement position-dependent weighting to fastText

Witiko commented 4 years ago

In Section 2.2 of the 2017 “Advances” paper by Mikolov et al., a position-dependent weighting is introduced to the context vector computation in the fastText CBOW model. On Common Crawl, this extension has led to 5 point improvement in total accuracy on the English word analogy task.

Despite the usefulness of the position-dependent weighting extension, it has not been implemented to facebook's code base yet. I propose to add this extension to Gensim's fastText and be the first publicly available implementation of the extension.

Table 4, row CBOW of the 2018 “Learning” paper by Grave et al. shows the word analogy task results of CBOW fastText models for 10 languages trained on Wikipedia:

Cs	De	Es	Fi	Fr	Hi	It	Pl	Pt	Zh
63.9	71.7	64.4	42.8	71.6	14.1	66.2	56.0	60.6	51.5

The CBOW fastText models use the position-dependent weighting extension and the default parameters described in Section 4.3 of the 2017 “Enriching” paper by Bojanowski et al.: hash table bucket size 2*10^6, 5 epochs, 300 vector dimensions, negative sampling loss with 5 negative samples, initial learning rate 0.05, sampling threshold 10^-4, and window size 5. The correctness of our implementation can be checked by comparing with Grave's results (above).

piskvorky commented 4 years ago

Sure, why not. The motivation is clear and shouldn't disrupt users.

I haven't looked into these papers (or perhaps forgot already) but what is the expected impact on performance (speed) and code complexity? Why wouldn't you want to use this weighting, as a user?

Witiko commented 4 years ago

What is the expected impact on performance (speed) and code complexity?

For every position in the context window ([-c, ..., −1, 1, ..., c]), you need to maintain a D-dimensional position vector in a 2c × D weight matrix. The weight matrix needs to be stored with the model. For each forward pass of the CBOW, we need to reweight the context word vectors with the position vectors. For each backward pass, we also need to compute and apply the gradient to the position vectors. Since 2c is small (default is 10), the impact on speed and storage size should be negligible.

Why wouldn't you want to use this weighting, as a user?

According to the papers, the weighting should be a net benefit.

piskvorky commented 4 years ago

There are some tight loops in the optimized code, so "negligible" will have to be carefully measured.

Otherwise I'm +1; @mpenkov @gojomo WDYT?

@Witiko are you volunteering to implement this optimization? :)

Witiko commented 4 years ago

I am. Hopefully, I can get to it by the end of May.

gojomo commented 4 years ago

If it's truly a benefit for many/all uses, and optional/negligible-cost for others, it seems a valuable option.

I'm always a bit wary of papers' claimed marginal advantages: they often cherry-pick some set of configs/evals, & ignore other tradeoffs or possible-avenues to the same improvement, to make their novelty seem stronger. (So for example: better on analogies doesn't always mean better for other IR/classification tasks. Or sometimes a "5 point improvement" glosses over any extra time/memory required for the technique, or the fact that some other existing meta-parameter optimization, if you're willing to spend more time/memory, can get the same improvement.)

So I would hope any contribution would include some demos of its advantages, and proof of its negligible effect when unused.

If I understand correctly, the positional weights are learned along with the vectors. Do they vary per target word? And, to the extent this replaces the crude reduced_window approach (dating back to word2vec), which served a dual purpose of simulating weighting and proportionately eliding further-words as an optimization, I'd be surprised if it didn't have some noticeable drag on runtime.

Note that the FT classes have gotten a major reworking (and a lot of head-scratching-nonsense has been removed) in my work-in-progress big KeyedVectors refactor (PR #2698), which is almost stable enough for wider review. (But, at the moment, will still get temp commits, rebases & force-pushes.) So you'd likely want to base any changes on that, rather than develop, until #2698 lands in develop.

And: doing a bunch of cleanup/maintenance has left a bad taste in my mouth for one-off contributions of narrow utility, with varied code styles, and crude if feature_enabled control-flow with lots of slightly-different cut&pasted code. When other fixes/features/refactorings become necessary, such contributions become a giant headache to maintain, especially if the original contributor isn't still involved. So personally, my desired standards for end-user utility, and code-quality/maintenance-debt, is a bit higher for new functionality than in the past.

Witiko commented 4 years ago

@gojomo

So I would hope any contribution would include some demos of its advantages, and proof of its negligible effect when unused.

My plan is to measure speed and word analogy task accuracy for all 10 languages four times:

with the develop branch (before #2698 has been merged),
with the develop branch (after #2698 has been merged),
with position-dependent weighting implemented on top of develop and switched off, and
with position-dependent weighting implemented on top of develop and switched on.

The expected scenario is that the first three tests yield the same results with insignificant differences in both accuracy and speed (otherwise, my implementation or #2698 are incorrect), and that the fourth test reproduces Grave's accuracies (above) with insignificant speed loss.

If I understand correctly, the positional weights are learned along with the vectors.
Do they vary per target word?

They do not, only with position (see Mnih et al. [2013, Section 3] and Mikolov et al. [2017, Section 2.2]).

And: doing a bunch of cleanup/maintenance has left a bad taste in my mouth for one-off contributions of narrow utility, with varied code styles, and crude if feature_enabled control-flow with lots of slightly-different cut&pasted code.

My plan is to have a new method fasttext_fast_sentence_cbow_pdw_neg in fasttext_inner.pxd and fasttext_inner.pyx, which will implement the new functionality without touching the current code. The positional weights can perhaps be stored at the tail of syn0_vocab for now, since they share the dimensionality with word vectors. The feature would be enabled using a new keyword argument pdw={0,1} of the FastText class constructor. What do you think?

gojomo commented 4 years ago

That's a pretty good & reassuring plan. Unless you've already got an implementation, be aware there's enough refactoring in #2698 that doing it on both develop and #2698 may require a lot of little tweaks. But to the extent it's possible for you, the extra verification of similar/improved quality/performance across all changes, both position-specific-weighing & #2698 refactoring, would be much appreciated!

Adding a separate fasttext_fast_sentence_cbow_pwd_neg path is nice in some respects for isolation of changes, but to the extent such a method might be mostly repeated setup/shared boilerplate, can also increase maintenance issues. So please keep an eye out for whether there could be a clean/minimal way to diverge at an even more-focused couple of places inside the current cbow path, even though that isn't strictly required. (You've only mentioned the _neg mode; I presume the optimization is possible/beneficial is the _hs modes as well?)

Appending the positional weights to the same array as the full-word vectors might work, but might add complications to the extent those are shared with the lookup-centric (and indpendently usable/savable/mutatable) [Fasttext]KeyedVectors object: appearing in iterations over the vectors, not being truly analogous in all per-vector metadata (like word counts), needing to be handled-specially when that structure grows. So beware of complications there – perhaps storing separately, or making them full pseudo-words with actual synthetic keys, would help. (As you've likely seen, gensim doesn't store both full-word vectors and ngram-vectors in one contiguous shared array.)

(Also: do you mean pdw to match the 'positive-dependent weighting' term in the lit, rather than pwd?)

Thinking more generally about the range of possibilities suggested by this optimization – & so, more 'blue sky' thoughts than any sort of direct constraints on your contribution – it strikes me that:

if the positional weights are often similar at the end of training – at least for a certain language or set of other metaparam – perhaps some users would sometimes just want to freeze them, & specify them based on a prior run or some 'best practice' standard, rather than always retrain them? (Though, if they're cheap enough to always retrain, maybe not.)
though looking at the target-word to choose from alternate sets of PDWs would in some sense be 'cheating', if treating the NN training as a authentic prediction-problem, for most word-vectors the ultimate criterion is just "how useful are they for something else?" So I wonder if those sorts of cheats, with this optimization or others, could be interesting research areas.
this same sort of learned-weightings would potentially be interesting for vanilla Word2Vec as well. (Though, at some level, if vanilla word2vec can be recast as a constrained instance of FastText with certain features turned off via corner-case parameter choices, maybe gensim ultimately wants to make its Word2Vec model a constraining wrapper around FastText, rather than FastText an enhancing-specialization of Word2Vec. @piskvorky, any thoughts?)
the same sort of learned-weightings would potentially be interesting for Doc2Vec as well - especially to the extent some dimensions might over training become more 'doc-vector' or 'word-position-n-vector' related - to some extent approximating the intent of the concatenation mode of Doc2Vec without the giant input-layer expansion.
I'm slightly reminded, also, of the 'Doc2Vec with corruption' work, which used a tunable drop-out of words to improve simple compositional doc-vector (like FT classification mode) generality under some conditions. Could a position-dependent drop-out probability be learned, for window-words or some doc-vec 'floating context' vectors, too? (Might such drop-out simulate weighting, but via fewer calcs, similar to 'reduced-window'-vs-explicit-cbow-position-weights?)

Again: more researchy-thoughts than anything else, but wanted to dump them here where they've been triggered.

piskvorky commented 4 years ago

maybe gensim ultimately wants to make its Word2Vec model a constraining wrapper around FastText, rather than FastText an enhancing-specialization of Word2Vec

My understanding is that the NLP world has moved on to unified "deep NN" architectures (e.g. the transformer), and word embeddings are now used primarily in that context.

So questions around internal *2vec trade-offs and APIs must be considered from that perspective. What makes the production and use of trained vectors easier?

The advantage of gensim's *2vec is its unique scalability, performance, cost (free) and Python. So we don't want to compromise on that, but other than that I have no preference for what-wraps-what.

I'd like to see the Gensim pipeline evolve with more complex document & similarity modeling (gensim's mission) of course. But seeing how the external contributions went over the past 4 years (one disaster after another in my estimation), I know gensim cannot possibly survive more of the same.

That leaves major changes to me and perhaps 1-2 other contributors whom I trust could hold the course. But who are all quite busy, so… not holding my breath. Good SW engineers are hard to come by, while "AI researchers" and "data scientists" are a dime a dozen.

Witiko commented 4 years ago

@gojomo I see #2698 has been merged. Perhaps now is a good time for me to start working on this.

gojomo commented 4 years ago

@Witko - definitely!

Witiko commented 4 years ago

@gojomo Sanity checks for the new architecture seem to be passing:

sanity-checks

gojomo commented 4 years ago

You may also notice that the memory use & saved-size of the model is much, much smaller.

Witiko commented 4 years ago

You may also notice that the memory use & saved-size of the model is much, much smaller.

Indeed. Seems like we're no longer storing the *_lockf arrays.

These are the model files (15G) stored via FastText.save before #2698:

$ du -hc data/models/*_ref=2360459_*
154M    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32
4.0K    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.accuracy
4.0K    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.duration
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.trainables.syn1neg.npy
2.3G    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.trainables.vectors_ngrams_lockf.npy
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.trainables.vectors_vocab_lockf.npy
2.3G    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.wv.vectors_ngrams.npy
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.wv.vectors.npy
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_iter=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=2360459_sample=0.0001_sg=0_size=300_window=5_workers=32.wv.vectors_vocab.npy
15G     total

These are the model files (11G) stored via FastText.save after #2698:

$ du -hc data/models/*_ref=c0e0169_*
343M    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32
4.0K    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.accuracy
4.0K    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.duration
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.syn1neg.npy
2.3G    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.wv.vectors_ngrams.npy
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.wv.vectors.npy
2.6G    data/models/fasttext-model_alpha=0.05_bucket=2000000_epochs=5_max-n=6_min-alpha=0_min-count=5_min-n=3_negative=5_ref=c0e0169_sample=0.0001_sg=0_vector-size=300_window=5_workers=32.wv.vectors_vocab.npy
11G     total

gojomo commented 4 years ago

Another forthcoming PR, #2892, will avoid saving the .vectors array – another 2.6GB in your model – which is redundant because it is completely recalculable from .vectors_vocab & .vectors_ngrams. (And, I think the bloat of the root file, from 154M to 343M, is something inadvertent & fixable.)

Witiko commented 4 years ago

@gojomo Those will be welcome improvements.

Meanwhile, I have produced a first implementation of the position-dependent weighting and I am currently training a model. There is currently no BLAS, since we need a Hadamard product followed by a dot product ([position vector ∘ context word vector]ᵀ ⋅ center word vector) without intermediary arrays (new1 and work are already occupied), which will need some more thinking over, so I just did the simplest thing that works for now. If the implementation works, I will be creating a PR in the next week.

Witiko commented 4 years ago

Our results show that just a few (10 out of 300) word vectors can be positionally weighted to receive the benefit of full position-dependent weighting (rightmost, 72%^* accuracy). This makes our implementation of position-dependent weighting practically fast:

92793537-c6990e80-f3ae-11ea-80ef-cae1a94799cb

More details in https://github.com/RaRe-Technologies/gensim/pull/2905#issuecomment-687100878.

^{_{^* All accuracies are currently reported on a small dev set (2.4B words) over one epoch and do not necessarily reflect the performance on a larger dataset.}}

Witiko commented 4 years ago

We held a seminar session yesterday regarding the use of word embeddings in information retrieval applications.
Implementation of position-dependent weighting into Gensim was a major topic, see 00:49:00--02:27:00 in the linked video.

Witiko commented 2 years ago

The work from draft #2905 has been accepted as a journal paper, which was co-authored by @piskvorky and is to appear in J.UCS 28:2. To conduct experiments for the paper, I have produced the high-level PInE library that uses a fork of Gensim 3.8.3 for model training. If there is interest, we can rebase the fork on top of the current develop branch.

piskvorky / gensim

Implement position-dependent weighting to fastText #2840