Closed JackStillwell closed 4 years ago
Thanks for posting the log. That collected 3 word types from a corpus of 14774682 raw words and 4924894 sentences
is extremely fishy.
It means there were only three distinct word types in the entire corpus (impossible).
Can you try:
from itertools import islice
# Print the trained model's vocabulary.
print(list(model_with_wikipedia.wv.keys()))
# Print the first two documents. For word2vec, each document should be a list of
# words (strings).
print(list(islice(corpus, 2)))
I'm not sure that gensim.downloader('wiki-english-2017001')
actually returns a sequence with the sort of items (lists-of-words) that Word2Vec
expects. (This user code showing the problem isn't literally any documentation example, but stitches together things from multiple places.)
What exactly does the string ''wiki-english-2017001'
, when given to downloader()
, do & return?
I'm not sure it's documented anywhere, and it's inherently hard to analyze given the sketchy behavior of the downloading API – it potentially downloads not-in-the-main-project source code & executes it, so the return value could be literally anything. (For example, that Dataset
class the example text8 code shows via print(inspect.getsource(corpus.__class__))
? I don't think that source comes from anywhere in https://github.com/RaRe-Technologies/gensim! What an arcane mess to present to beginners.)
@gojomo Your aversion to downloader
is known.
The documentation is clear though: https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001
Once @JackStillwell tries my two commands above, the issue will become apparent :)
Namely, that the iteration yields dicts, and the model "vocabulary" are the three dict keys like section_titles
and section_texts
.
@JackStillwell word2vec needs each document as a list of words, not a dict. So you want an extra step that takes the titles and texts from each Wikipedia article, and presents them as a list of strings to word2vec. Check out Generators, iterators, iterables if unsure how streaming works in Python.
I guess it even makes sense to send each article section as a separate document to word2vec (so each Wikipedia article becomes several documents). But try it and see what works better for you.
That page is helpful about the data-contents of 'wiki-english-20171001', but I haven't noticed that page linked from docs/examples that encourage use of .load('wiki-english-20171001')
. (I suppose it might be found via Google?)
And, the page (like gensim docs of gensim.api
) doesn't describe key aspects of what gensim.api.load('wiki-english-20171001')
does. Neither its return-type is declared, nor is there any disclosure that "loading a dataset" will actually run arbitrary additional Python code loaded from elsewhere.
That __init__.py
code that's being run on end-user machines isn't in the gensim-data
or gensim
source trees. It's not even easily web-browsable at Github, which prompts a download.
What's this source code's authors or history? (Impossible to tell for such 'assets', using a Github feature that's not designed for source code.)
If it were buggy, how would someone contribute a fix? (I see no way to open a reviewable PR against it.)
If it were maliciously changed, who would even notice? (AFAICT, changes to the 'assets' area of an existing Github project release can happen anytime, and generate no public logs/notifications. Perhaps it's visible to maintainers?)
Problem description
When following the example code here I receive a "Word not in vocabulary" error. Opening at the request of Radim: https://groups.google.com/forum/#!topic/gensim/ULW_OKrPtqE
Steps/code/corpus to reproduce
Logging output (truncated to non-repeat / progress):
Versions
Linux-4.15.0-74-generic-x86_64-with-debian-buster-sid Python 3.7.4 (default, Sep 5 2019, 19:15:53) [GCC 7.4.0] NumPy 1.18.1 SciPy 1.4.1 gensim 3.8.1 FAST_VERSION 1