Open ldmtwo opened 4 years ago
I don't believe load_facebook_model()
alone would generate that log output. Is that the full code for generating that output, or were you also calling something else after the load? (Loading a model with 2M words wouldn't typically cause any reports of 4M words.)
Do you have any tangible evidence any words have been lost? (Even when that log output normally happens, it's just declaring what the effect of sampling will be on further text training - nothing is actually discarded.)
Any effects of build_vocab_from_freq()
would be a separate matter – your code doesn't show if/how/when you're calling it – and I don't believe it would ever work to modify an existing model's known vocabulary. (It's an option used on new models instead of a typical corpus-survey.)
How can I get this to load the original model without reducing the quality as shown in the log? I plan to continue training and it automatically downsamples. When I call
model.build_vocab_from_freq()
, it also reduces from the intended vocab.The data I'm using is
crawl-300d-2M-subword.zip: 2 million word vectors trained with subword information on Common Crawl (600B tokens).