add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors

Suggested on project discussion list (https://groups.google.com/g/gensim/c/CsER2XBs8P4/m/f2EntuXRAgAJ):

Having discovered the undocumented feature that common words like I'm we're don't

etc are OOV in the common GloVe pretrained models

(while words like o'clock are in so you can't just split on apostrophe/single quotes)

and seeing no docs except some vague references that Stanford parser with undocumented switches MIGHT have been used to generate the common pretrained GloVe models

and finding ZERO comments from Google about how they preprocessed the text used for Word2Vec's Google News pretrained model

it seems to me that GenSim would do people a lot of good by making tokenizers matching each of their most popular included pretrained models so that users are writing NLP programs that speak the same language as their models rather than comparing apples to oranges.

My thoughts:

A desire for help here has come up a lot – & at times I've shared my observations about what can be deduced from the limited statements, & observable contents, of pre-trained vector sets like the 'GoogleNews' release.

However, without disclosures (or better yet code) from the original researchers who prepared such pretrained vectors, all such efforts will only ever be gradually-approximating their practices, with lingering exceptions & caveats generating more questions.

Also: it often seems to be beginner & small-data projects that are most-eager to re-use pretrained vectors from elsewhere, under the assumption those must be the "right" thing, or better than what they'd achieve. But: many times that's not the case.

For example, GoogleNews was trained on an internal Google corpus of news articles 11+ years ago. It used a statistical model for creating multiword-tokens whose exact parameters/word-frequencies/multigram-frequencies has never been disclosed. For many current projects, word-vectors trained on more-recent domain-specific data via understood & conciously-chosen proprocessing – even much less data! – will likely generate better vocabulary & relevant-word-sense coverage than Google's old work.

So while I'd see some value in a "best guess" function to mimic the tokenizing choices of those commonly-used pretrained sets – as a research effort, or contribution – I'd also prefer it prominently-disclaimered as non-official, & not-necessarily-an-endorsement of preferring those vectors, and that tokenization, for anyone's particular purpose.

At this point, devising such helpers would be a sort of software-archeology/mystery project, and I'd not see it as any sort of urgent priority. But, it might make a good new-contributor, student, or hackathon project – especially if eventual integration includes good surrounding docs/discussion/demos of the limits/considerations involved in reusing another project's vectors/preprocessing choices.

piskvorky / gensim

add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors #3485