slow speed for SIF model for large corpus

oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!

GNU General Public License v3.0

616 stars 83 forks source link

slow speed for SIF model for large corpus #32

Closed aydv closed 3 years ago

aydv commented 4 years ago

Hi, I have been experimenting with fse. For small dataset 200-300k sentences, embedding generation was very fast. But now i am training with large data corpus of 50 million sentences. I am using 12 workers and still the training for embeddings is very slow. From logs it is somewhat 700 sentences/sec. I am using gensim.models.FastText Also got a user warning of "C extension not loaded, training/inferring will be slow. " on Ubuntu 16.04. Any way to increase the speed? Thank you

oborchers commented 3 years ago

Hi, you get the warning that the C extension is not loaded then something went wrong during the Cython installation. Therefore, fse will only use the existing numpy implementation (slow as frick)

Try setting up a new environment and check if your Gensim implementation also throws the same error. If not, please report the issue again, because then something seems to went wrong in the setup process.

Additionally: Try using 1-2 workers and only then increase the number.