oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
619 stars 83 forks source link

Best way to save a fine-tuned vectorizer object for later use #71

Closed adkinsty closed 1 year ago

adkinsty commented 1 year ago

Thanks for creating this package! I just have one quick question.

After fine-tuning the vectorizer on my text:

vecs = Vectors.from_pretrained(model_name)
vectorizer = Average(vecs)
vectorizer.train(IndexedList(sent_train))

what is the best way to save the vectorizer object for later use? Currently I am trying to use pickle, like so:

with open(f'{path}/vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

The resulting pickle file has a size of 6.9gb.

Thanks for your time.

adkinsty commented 1 year ago

Ah, actually, perhaps I was confused. I had assumed that the .train() method does some sort of fitting/fine-tuning with the text whereas .infer() merely transforms the text. But if not, then there is no need to save the vectorizer for re-use. I can simply initialize a new vectorizer and use that to transform new text data.

P.S. the pre-trained model I'm using here is fasttext-crawl-subwords-300