jhlau / doc2vec

Python scripts for training/testing paragraph vectors
Apache License 2.0
640 stars 191 forks source link

Pretrained Embedding, TypeError: don't know how to handle uri #24

Closed gg2572 closed 6 years ago

gg2572 commented 6 years ago

@jhlau Hey Jey, I hope you're doing well. This is Gan and I was trying to use your forked version and load a pre-trained word vector, wiki-news-300d-1M.vec, from https://fasttext.cc/docs/en/english-vectors.html; however, I'm getting the error: TypeError: don't know how to handle uri, and I think it's from the smart_open function. I'm training a very small corpus so I think it may be better to initialize with the pre-trained vector. Following is the code:

vector_size = 100
window_size = 10
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 50
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 4 #number of parallel processes

wordvec = "/Users/ggao/Downloads/wiki-news-300d-1M.vec" 
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

pretrained_emb = load_vectors(wordvec)
pre_model = gensim.models.doc2vec.Doc2Vec(documents=train_corpus, dm=dm, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, negative=negative_size, workers=worker_count, pretrained_emb=pretrained_emb, iter=train_epoch)

Do you have any idea what's wrong on here? Thank you so much and look forward to your reply.

Best, Gan

jhlau commented 6 years ago

It's likely that the pre-trained word vector was trained on a newer version of gensim, and so you an't load it with my (very old) forked version of gensim.

gg2572 commented 6 years ago

Hi @jhlau , Thank you for your quick reply. Can I use the pre-trained word2vec model on your git https://github.com/jhlau/doc2vec? Does the model contain the word embedding? (I don't find it in the model's values.)

Best, Gan

jhlau commented 6 years ago

You can find the pre-trained word2vec model on the README: https://github.com/jhlau/doc2vec/blob/master/README.md

We released pre-trained English Wikipedia and AP-NEWS word embeddings.

gg2572 commented 6 years ago

@jhlau Thank you. I loaded the word2vec model and used the function save_word2vec_format to save the trained vector, and later loaded it to the doc2vec model. It's working. Thanks again for your help!