piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

load_facebook_model() perturbs model with lower quality #2862

Open ldmtwo opened 4 years ago

ldmtwo commented 4 years ago

How can I get this to load the original model without reducing the quality as shown in the log? I plan to continue training and it automatically downsamples. When I call model.build_vocab_from_freq(), it also reduces from the intended vocab.

    from  gensim.models.fasttext import *
    model_path = datapath(model_file)
    model = load_facebook_model(model_path)

INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from /home/me/data/external_models/cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from /home/me/data/external_models/cc.en.300.bin

The data I'm using is crawl-300d-2M-subword.zip: 2 million word vectors trained with subword information on Common Crawl (600B tokens).

gojomo commented 4 years ago

I don't believe load_facebook_model() alone would generate that log output. Is that the full code for generating that output, or were you also calling something else after the load? (Loading a model with 2M words wouldn't typically cause any reports of 4M words.)

Do you have any tangible evidence any words have been lost? (Even when that log output normally happens, it's just declaring what the effect of sampling will be on further text training - nothing is actually discarded.)

Any effects of build_vocab_from_freq() would be a separate matter – your code doesn't show if/how/when you're calling it – and I don't believe it would ever work to modify an existing model's known vocabulary. (It's an option used on new models instead of a typical corpus-survey.)