piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Load full native fastText Facebook model is partial #2969

Open aviclu opened 3 years ago

aviclu commented 3 years ago

Problem description

Hidden vectors are bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.

'FastTextTrainables' object has no attribute 'syn1'

Steps/code/corpus to reproduce

Simply using ft = gensim.models.fasttext.load_facebook_model(fname) on Facebook's model. Then ft.syn1 or ft.trainables.syn1neg which returns the zero array.

Versions

Please provide the output of: Windows-2012ServerR2-6.3.9600-SP0 Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] Bits 64 NumPy 1.18.3 SciPy 1.4.1 gensim 3.8.3 FAST_VERSION 0

gojomo commented 3 years ago

Are you using a particular public model, and if so, which one? Alternatively, if using a private model, with what parameters was it trained?

aviclu commented 3 years ago

@gojomo I'm using the official crawl-300d-2M-subword.bin file, which I downloaded from https://fasttext.cc/docs/en/english-vectors.html

gojomo commented 3 years ago

Thanks. There would only be one of either syn1 or syn1neg - but whichever loads, should have non-zero values.

Was anything anomalous displayed during load, especially if setting global logging level to DEBUG?

gojomo commented 3 years ago

I've confirmed that even in our pre-4.0.0 develop branch, which has a lot of FastText fixes (but nothing specifically touching load_facebook_model), loading that model results in an all-zeros syn1neg.

It looks like @mpenkov added the load_facebook_model entry point in #2376, but it wholly depends on the earlier _load_fasttext_format function also by @menshikh-iv.

Are we sure this ever worked? Is there a chance the file itself has zeros? (Trying load_facebook_model(datapath('lee_fasttext_new.bin')), a toy-sized testing file checked in unit tests, does show non-zeros in the model's syn1neg.)

gojomo commented 3 years ago

From a quick scan of tests in test_fasttext.py, I don't see anything that does a meaningful test of the results of load_facebook_model() other than just the loaded vectors. (That is: nothing to test that which makes load_facebook_model different from load_facebook_vectors.)

There is one attempted roundtrip test, if the native FT_HOME directory is available, in SaveFacebookByteIdentityTest. But that directory isn't usually available, so I'm unsure if/ever this was working.

It's likely load_facebook_model doesn't work at all for its intended purpose.

piskvorky commented 3 years ago

Marking this as blocking for 4.0.0 – CC @mpenkov can you check?

mpenkov commented 3 years ago
mpenkov commented 3 years ago
import gensim.models.fasttext
import gensim.test.utils
path = gensim.test.utils.datapath('lee_fasttext_new.bin')
model = gensim.models.fasttext.load_facebook_model(path)
print(model.syn1neg)

Gives:

array([[ 0.27832156,  0.15093271, -0.05810147, ...,  0.20399494,
         0.10794587, -0.17611295],
       [ 0.04015477,  0.2320431 , -0.31041363, ...,  0.07040029,
         0.17735204, -0.23731148],
       [ 0.33127972, -0.08667868, -0.1704444 , ...,  0.20603168,
         0.11391634, -0.15840392],
       ...,
       [ 0.17141579,  0.02448652, -0.14411658, ..., -0.07036947,
         0.4076898 , -0.33286095],
       [ 0.09963796,  0.09554827, -0.1726573 , ..., -0.11196624,
         0.25655633, -0.24722196],
       [ 0.16295125, -0.02737397, -0.12545614, ..., -0.00165336,
         0.31274942, -0.20620131]], dtype=float32)
gojomo commented 3 years ago

Indeed, that load (from a tiny file in the test directory of unclear vintage) gives a syn1neg value that looks correct, as noted in my comment of 2020-09-30.

The report is of zeros when loading a large full model from Facebook - specifically crawl-300d-2M-subword.bin.

mpenkov commented 3 years ago

Yeah, I had to leave it loading overnight. And yes, I get the same results as you. So now we're on the same page.

import sys
import gensim.models.fasttext
path = sys.argv[1]
model = gensim.models.fasttext.load_facebook_model(path)
print(model.syn1neg)
$ time python repr_real.py ~/Downloads/crawl-300d-2M-subword/crawl-300d-2M-subword.bin
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

real    30m43.069s
user    3m28.906s
sys     3m2.842s
mpenkov commented 3 years ago

I had a closer look a that file (crawl-300d-2M-subword.bin). At the end of the file, where we expect the hidden layer to be, there's a bunch of zeros.

import collections
import io
import gensim.models._fasttext_bin

path = '/Users/misha/Downloads/crawl-300d-2M-subword/crawl-300d-2M-subword.bin'
seek_pos = 4835845135  # obtained via pdb
with open(path, 'rb') as fin:
    fin.seek(seek_pos)
    matrix_bytes = fin.read()
    fin.seek(seek_pos)
    matrix = gensim.models._fasttext_bin._load_matrix(fin, new_format=True)

print(matrix)

counter = collections.Counter()
counter.update(matrix_bytes)
print(counter)

I got the seek position by inserting a breakpoint into the loading code here.

$ time python repr_readmatrix.py 
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Counter({0: 2400000012, 128: 1, 132: 1, 30: 1, 44: 1, 1: 1})

real    2m59.471s
user    2m43.491s
sys     0m7.710s

Our code correctly interprets that as a (2M x 300) matrix of zeros.

I can think of two explanations for this.

  1. Something changed in the model format and we haven't been keeping up. I think we may need to revisit the format of their model files.
  2. Their model is buggy. It's unlikely we (or anybody else) can extract anything other than zeros from a slab of bytes that is 99.99% zero.

@gojomo @piskvorky Which do you think is the more likely explanation? Could there be another?

gojomo commented 3 years ago

That's suspicious, as I'd not expect any large ranges-of-zero-vectors in a truly saved model. Maybe, point out the oddity & ask at the FacebookResearch Fasttext project issues? Devise a differential test that'd work well with a real syn1neg but poorly or not-at-all with an uninitialized layer? (I'm having a hard time thinking of a stark, compact test. Any effect might be most evident in a -supervised mode model - which that file isn't, and perhaps files in that mode might be saving the 'right' things even if this file isn't.)

mpenkov commented 3 years ago

Should we still treat this as a blocker for 4.0.0?

mpenkov commented 3 years ago

I doesn't look like the FB guys will examine this anytime soon, so I suggest we remove this from the milestone and move on with the release.

mpenkov commented 3 years ago

@piskvorky Removing this from the milestone as discussed during our last meeting. Please let me know if I've misunderstood.

piskvorky commented 3 years ago

Yes, thanks. If it's really a bug with the FB model, not much we can do about it.