Closed Tomeu7 closed 5 years ago
The document embedding indexes should be in the same order as your input dataframe. Just make sure you pass in the idxs as a numpy array of indexes of documents you'd like to check. Let me know if you come across issues with this function...it may have a couple bugs.
"""
Args:
idxs - numpy array of indexes to check similarity to
in_type - string denoting what kind of embedding to check
similarity to. Options are "word", "doc", and "topic"
out_type - same as above, except it will be what we are comparing the
in indexes to.
k - Number of closest examples to get
idx_to_word - index to word dictionary mapping. If passed, it will translate the indexes.
NOTE: Acceptable pairs include
word - word
word - topic
topic - word
doc - doc
"""
Thanks I just have done that.
Does it makes sense that the results are very bad?
I used the original hyperparameters with my own dataset and training with 10 epochs. (No google news pretrained embedings).
I cant understand the printed loss. Initially it is something like this loss: 5.035, word2vec: 5.035, lda: 3805
The word2vec and loss are the same which go down but lda goes up.
Please train longer. >20 epochs you will start to see the results
Hello nateraw, nice implementation of lda2vec!
I am investigating the use of lda2vec to get the most similar phrases in a text. Initially I have a dataframe with M number of rows, where M are the number of phrases.
I train lda2vec for some epochs and then I use the function get_k_closest(idx of phrase I want to check, in_type='doc',vs_type='doc', k = 2), would this be the idea?
The thing with this is that I am not sure if the indices of the docs are the same that the ones that come with my dataframe at the beginning.
Thanks!