jhlau / doc2vec

Python scripts for training/testing paragraph vectors
Apache License 2.0
644 stars 192 forks source link

save model in non-binary format #11

Closed michaelwiegand82 closed 7 years ago

michaelwiegand82 commented 7 years ago

How can I save the model in non-binary format? Thank you.

jhlau commented 7 years ago

I don't think the doc2vec code provides a native function for saving doc2vec model in non-binary format. You can of course manually pull out the weights and save them yourself.

michaelwiegand82 commented 7 years ago

I do not really understand what you mean by "pulling out the weights". What I did now is using the training documents as test documents (since we are here doing unsupervised classification, there should not be a problem with that) and then run the infer_test.py script. Is that what you had in mind?

jhlau commented 7 years ago

Ah I see. You can do what you are doing now, but the vectors themselves might actually bit a little different when you're re-inferring them (the inference procedure is basically a pseudo-training step with randomly initialised document vector).

If all you're looking are the train document vectors, it's saved in the model and you can get them by doing something as follows:

model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)

vector = m.docvecs[0] #vector is the document vector for the first document

jhlau commented 7 years ago

For more information, you can refer to the code: https://github.com/jhlau/gensim/blob/develop/gensim/models/doc2vec.py#L261

michaelwiegand82 commented 7 years ago

Thank you for these helpful information!