Question about get_k_closest

Tomeu7 commented 5 years ago

Hello nateraw, nice implementation of lda2vec!

I am investigating the use of lda2vec to get the most similar phrases in a text. Initially I have a dataframe with M number of rows, where M are the number of phrases.

I train lda2vec for some epochs and then I use the function get_k_closest(idx of phrase I want to check, in_type='doc',vs_type='doc', k = 2), would this be the idea?

The thing with this is that I am not sure if the indices of the docs are the same that the ones that come with my dataframe at the beginning.

Thanks!

nateraw commented 5 years ago

The document embedding indexes should be in the same order as your input dataframe. Just make sure you pass in the idxs as a numpy array of indexes of documents you'd like to check. Let me know if you come across issues with this function...it may have a couple bugs.

        """
        Args:
        idxs - numpy array of indexes to check similarity to
        in_type - string denoting what kind of embedding to check
                  similarity to. Options are "word", "doc", and "topic"
        out_type - same as above, except it will be what we are comparing the
                   in indexes to.
        k - Number of closest examples to get
        idx_to_word - index to word dictionary mapping. If passed, it will translate the indexes.

        NOTE: Acceptable pairs include
        word - word
        word - topic
        topic - word
        doc - doc
        """

Tomeu7 commented 5 years ago

Thanks I just have done that.

Does it makes sense that the results are very bad?

I used the original hyperparameters with my own dataset and training with 10 epochs. (No google news pretrained embedings).

I cant understand the printed loss. Initially it is something like this loss: 5.035, word2vec: 5.035, lda: 3805

The word2vec and loss are the same which go down but lda goes up.

nateraw commented 5 years ago

Please train longer. >20 epochs you will start to see the results

nateraw / Lda2vec-Tensorflow

Question about get_k_closest #16