pdrm83 / sent2vec

How to encode sentences in a high-dimensional vector space, a.k.a., sentence embedding.
MIT License
133 stars 12 forks source link

glove path error #7

Open yassmine-lam opened 3 years ago

yassmine-lam commented 3 years ago

Hi,

I tried to use word2vec code with glove embeddings glove.6B.300d.txt but I got this error

ValueError: invalid literal for int() with base 10: 'the'

Could someone help plz

thank u

almarengo commented 2 years ago

From gensim you can load GloVe pretrained weights of different sizes:

Here is the GloVe Official Page

Download the file from the website above. You can then substitute the file name in glove_file with the path to the file that you have downloaded.

This is how you would want to implement it within sent2vec

from sent2vec.vectorizer import Vectorizer
from sent2vec.splitter import Splitter

from gensim.test.utils import get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

sentences = [
    "Alice is in the Wonderland.",
    "Alice is not in the Wonderland.",
]

glove_file = 'glove.6B.300d.txt'
word2vec_glove_file = get_tmpfile("glove.6B.300d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

splitter = Splitter()
splitter.sent2words(sentences=sentences, remove_stop_words=['not'], add_stop_words=[])
vectorizer = Vectorizer()
vectorizer.word2vec(splitter.words, pretrained_vectors_path= word2vec_glove_file)
vectors = vectorizer.vectors

I hope it helps.