ryankiros / skip-thoughts

Sent2Vec encoder and training code from the paper "Skip-Thought Vectors"
2.05k stars 544 forks source link

Are sentence vectors accessible from the trained model? #48

Open kuhanw opened 7 years ago

kuhanw commented 7 years ago

Dear experts,

I recently stumbled upon the Skip-Thought paper and found it extremely interesting. I have managed to train a small model using some 2.7 million sentences for testing purposes. My primary interesting is to understand sentence to sentence similarity by comparing the distance between the vector embeddings.

My question is, after training, can the vector representation of the training sentences be accessed from the model? I know I can encode the sentences afterward by using tools.encode(), but for a large number of sentences this will take quite some time, time that is already in addition to the training itself.

Naively, I thought that analogous to doc2vec models, there would be a dictionary of sentences (like a dictionary of paragraphs/documents), along with their vector embeddings.

Is this the case? Perhaps I misunderstood sections of the paper. I can certainly find the token level embeddings in the OrderedDict called model['table'].

Thank You and keep up the good work!

Kuhan

csiki commented 7 years ago

Not that I know of. Sentences are built on the fly using the stored word vectors.

On Thu, Apr 6, 2017, 8:49 AM kuhanw notifications@github.com wrote:

Dear experts,

I recently stumbled upon the Skip-Thought paper and found it extremely interesting. I have managed to train a small model using some 2.7 million sentences for testing purposes. My primary interesting is to understand sentence to sentence similarity by comparing the distance between the vector embeddings.

My question is, after training, can the vector representation of the training sentences be accessed from the model? I know I can encode the sentences afterward by using tools.encode(), but for a large number of sentences this will take quite some time, time that is already in addition to the training itself.

Naively, I thought that analogous to doc2vec models, there would be a dictionary of sentences (like a dictionary of paragraphs/documents), along with their vector embeddings.

Is this the case? Perhaps I misunderstood sections of the paper. I can certainly find the token level embeddings in the OrderedDict called model['table'].

Thank You and keep up the good work!

Kuhan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ryankiros/skip-thoughts/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRaSAzGf9h7PWnjCxdrmvRTm9KSooZNks5rtQmCgaJpZM4M1y98 .