piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.37k forks source link

Word2Vec Wikipedia Corpus 2017 no vocabulary #2737

Closed JackStillwell closed 4 years ago

JackStillwell commented 4 years ago

Problem description

When following the example code here I receive a "Word not in vocabulary" error. Opening at the request of Radim: https://groups.google.com/forum/#!topic/gensim/ULW_OKrPtqE

Steps/code/corpus to reproduce

from gensim.models.word2vec import Word2Vec
import gensim.downloader as gensim_download_api

wikipedia_corpus = gensim_download_api('wiki-english-2017001')
model_with_wikipedia = Word2Vec(wikipedia_corpus)
model_with_wikipedia.wv.most_similar('cat')
KeyError: "word 'cat' not in vocabulary"

Logging output (truncated to non-repeat / progress):

INFO:gensim.models.word2vec:PROGRESS: at sentence #4920000, processed 14760000 words, keeping 3 word types
INFO:gensim.models.word2vec:collected 3 word types from a corpus of 14774682 raw words and 4924894 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 3 unique words (100% of original 3, drops 0)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14774682 word corpus (100% of original 14774682, drops 0)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 3 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 3 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 853566 word corpus (5.8% of prior 14774682)
INFO:gensim.models.base_any2vec:estimated required memory for 3 words and 100 dimensions: 3900 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5

Versions

Linux-4.15.0-74-generic-x86_64-with-debian-buster-sid Python 3.7.4 (default, Sep 5 2019, 19:15:53) [GCC 7.4.0] NumPy 1.18.1 SciPy 1.4.1 gensim 3.8.1 FAST_VERSION 1

piskvorky commented 4 years ago

Thanks for posting the log. That collected 3 word types from a corpus of 14774682 raw words and 4924894 sentences is extremely fishy. It means there were only three distinct word types in the entire corpus (impossible).

Can you try:

from itertools import islice

# Print the trained model's vocabulary.
print(list(model_with_wikipedia.wv.keys()))

# Print the first two documents. For word2vec, each document should be a list of
# words (strings).
print(list(islice(corpus, 2)))
gojomo commented 4 years ago

I'm not sure that gensim.downloader('wiki-english-2017001') actually returns a sequence with the sort of items (lists-of-words) that Word2Vec expects. (This user code showing the problem isn't literally any documentation example, but stitches together things from multiple places.)

What exactly does the string ''wiki-english-2017001', when given to downloader(), do & return?

I'm not sure it's documented anywhere, and it's inherently hard to analyze given the sketchy behavior of the downloading API – it potentially downloads not-in-the-main-project source code & executes it, so the return value could be literally anything. (For example, that Dataset class the example text8 code shows via print(inspect.getsource(corpus.__class__))? I don't think that source comes from anywhere in https://github.com/RaRe-Technologies/gensim! What an arcane mess to present to beginners.)

piskvorky commented 4 years ago

@gojomo Your aversion to downloader is known.

The documentation is clear though: https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001

Once @JackStillwell tries my two commands above, the issue will become apparent :)

Namely, that the iteration yields dicts, and the model "vocabulary" are the three dict keys like section_titles and section_texts.

@JackStillwell word2vec needs each document as a list of words, not a dict. So you want an extra step that takes the titles and texts from each Wikipedia article, and presents them as a list of strings to word2vec. Check out Generators, iterators, iterables if unsure how streaming works in Python.

I guess it even makes sense to send each article section as a separate document to word2vec (so each Wikipedia article becomes several documents). But try it and see what works better for you.

gojomo commented 4 years ago

That page is helpful about the data-contents of 'wiki-english-20171001', but I haven't noticed that page linked from docs/examples that encourage use of .load('wiki-english-20171001'). (I suppose it might be found via Google?)

And, the page (like gensim docs of gensim.api) doesn't describe key aspects of what gensim.api.load('wiki-english-20171001') does. Neither its return-type is declared, nor is there any disclosure that "loading a dataset" will actually run arbitrary additional Python code loaded from elsewhere.

That __init__.py code that's being run on end-user machines isn't in the gensim-data or gensim source trees. It's not even easily web-browsable at Github, which prompts a download.

What's this source code's authors or history? (Impossible to tell for such 'assets', using a Github feature that's not designed for source code.)

If it were buggy, how would someone contribute a fix? (I see no way to open a reviewable PR against it.)

If it were maliciously changed, who would even notice? (AFAICT, changes to the 'assets' area of an existing Github project release can happen anytime, and generate no public logs/notifications. Perhaps it's visible to maintainers?)