oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
619 stars 83 forks source link

save sif_model.sv.vectors.npy file is very large? #14

Closed RyanHuangNLP closed 4 years ago

RyanHuangNLP commented 4 years ago

I found my sif_model.sv.vectors.npy file is just (758194, 100) matrix, but that file is 15G, while I save a (800000, 100) matrix to npy file, it just 600mb, so is it normal? I train the sif model on 30 million sentences

-rw-r--r-- 1 ke ke  43M oct 11 19:09 sif_model
-rw-r--r-- 1 ke ke  15G oct 11 19:09 sif_model.sv.vectors.npy  <<----- this file very large
-rw-r--r-- 1 ke ke 290M oct 11 19:07 sif_model.wv.vectors.npy
oborchers commented 4 years ago

Hi! If you train the model on 30 million sentences you should end up with an array of size (30*10^6, 100).

The formula to determine the approx size of the array is: sentences * vector_size * np.dtype(np.float32).itemsize

For your purpose that'd be equal to: 30e6*100*np.dtype(np.float32).itemsize / 1024**3 which is about 12G. Thus 15 is a bit higher than expected