nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

same result for all topic generated #4

Closed XiaofengZHOU closed 6 years ago

XiaofengZHOU commented 6 years ago

for my understanding, the matrix of the topics generated are in the same space of the word vector. And we use the topci maxtrix to find the most similar word by the cosine similarity. And then the word that we have fund can be a representation of the topic.

But the result i have got shows that all the topic maxtrix generated are almost same. I can not figure out why. Here are some of the result :

print(topic[4]) [-0.9622485 0.8895183 0.8651555 -0.9276399 -0.9396336 0.93779755 -0.9743131 0.94305694 -0.92948157 -1.0672562 0.946625 -0.99164987 0.8959647 0.95344895 0.9274684 -0.97949797 0.97142816 0.947076 -1.0015502 0.96531034 0.8757545 0.94082266 0.954677 -0.97633624 0.87975 0.9366757 -0.93371624 -0.85707355 0.98357856 -0.93866247 0.9577415 0.94209754 -0.97033393 -0.9504832 0.9234292 0.9165397 -0.9694142 0.91393214 0.9972066 -0.9942078 -0.9907095 -0.9176958 0.93074447 -0.8706515 -0.92425114 -1.0101646 0.95657563 -1.0012354 0.95422584 -0.764645 0.9863512 -0.99371105 0.9823682 0.64269054 -0.9487983 -0.56981754 -1.0187954 0.9872439 0.67288846 0.92767256 -0.95255184 0.7126149 -0.92712885 0.9122812 0.8112471 -1.0150576 0.80759007 0.9772657 0.974533 -0.89622474 0.96457 -0.94705147 0.9997022 -0.9722624 0.9418657 0.9430709 -0.96311724 -0.97360986 -0.8987086 0.9817178 0.8594237 -0.93254995 -0.87266266 0.98293287 -0.6322944 -0.9245911 0.95225286 -1.0082532 -0.9219543 -0.9784668 -0.9714366 -0.9701755 0.9802913 -0.94296515 -0.89987594 -0.9654876 -0.92532563 -0.9081519 0.7952786 0.9535129 ] print(topic[5]) [-0.9545226 0.8836221 0.8790286 -0.9436467 -0.9647125 0.95075834 -0.9890084 0.9377537 -0.94952726 -1.0689101 0.980626 -0.9908181 0.8998709 0.94127303 0.9263142 -0.96562505 0.99156046 0.95024383 -1.0077744 0.99384195 0.8860567 0.92229956 0.9736233 -0.96262467 0.89396423 0.9315409 -0.9482396 -0.85639435 0.9852119 -0.9602194 0.95691586 0.94624454 -0.98274666 -0.9827932 0.9232413 0.93340456 -0.97113854 0.93778706 1.0019037 -0.9843718 -1.0034899 -0.92478126 0.95473534 -0.8701034 -0.9313964 -0.9949094 0.97523534 -1.0191345 0.9864202 -0.73943955 1.0138143 -0.9930289 0.9773597 0.6448753 -0.94340485 -0.55352324 -1.004822 0.99961305 0.68788236 0.9397265 -0.9823522 0.75456184 -0.9445327 0.9221488 0.8499458 -1.0050296 0.8211724 0.9643316 0.98302233 -0.8961856 0.9766408 -0.9336542 1.0224456 -0.982251 0.9577986 0.97083366 -0.94915915 -0.9802646 -0.9033424 0.97875696 0.8598247 -0.91498125 -0.8607036 0.98732114 -0.643369 -0.93571526 0.96445656 -1.0014955 -0.94695365 -0.9552077 -0.98248726 -0.99457294 0.9754661 -0.9417462 -0.87800306 -0.9567253 -0.94087964 -0.9052637 0.78514445 0.94565785]

nateraw commented 6 years ago

How long did you train it for?

XiaofengZHOU commented 6 years ago

In fact i found a small problem in your code in the preprocessing part.

self.T = Tokenizer(num_words=self.vocab_size)

    # Fit the tokenizer with the texts
    self.T.fit_on_texts(text_clean)

    # Turns our input text into sequences of index numbers
    self.data = self.T.texts_to_sequences(text_clean)

    # Subtract 1 from each index
    self.data = [[col+1 for col in row] for row in self.data]

    # Delete to reduce memory
    del text_clean

self.data = [[col+1 for col in row] for row in self.data] , this line for my understanding, you want the index to start from 0. But keras start the indexing from 1. So here should de minus 1.

So the code below make sense. self.word_to_idx = {k: v-1 for k, v in self.T.word_index.items()}

Flip the dictionary for word_to_idx

self.idx_to_word = {v: k for k, v in self.word_to_idx.items()}

And the small mistake u made will lead to a problem of index in loading glove pretrained word vector. Hope u can check it out.

nateraw commented 6 years ago

Yeah, this is a known issue. I'm new to managing these issues on github, but when I get the chance I'll figure out how to consolidate the repeat issues and address this bug with an update.

In the new version, which I promise I will post today, the preprocessing file will be completely different. I will use spaCy as the original author intended. For now, my suggestion would be to do your own preprocessing if you can. Sorry for the hassle!

nateraw commented 6 years ago

In response to the topics becoming very similar, this is puzzling me. The math seems to be right, but I am experiencing the same thing. In the original repo, there is an issue that mirrors this one:

https://github.com/cemoody/lda2vec/issues/37

Apparently, some users say you have to tune the hyperparameters correctly and also let it train for a very long time in order for it to work. However, even after many experiments, I have yet to see this work.

nateraw commented 6 years ago

Closing this, as the most recent update fixes this problem. Use the original loss function, not the negative one.

gveni commented 5 years ago

I am running into the same issue even after using the updated code. I ran it on twenty_newsgroups data as well as on our own in house data (we have altogether ~4000 datapoints). I ran it for 250 epochs and used default parameters.

nateraw commented 5 years ago

Thank you for commenting on an already existing issue and not making a new one!!

Default parameters are actually not the best, as far as I've seen. Perhaps more tweaks should be made and pushed. Believe me, I know how frustrating it is to wait that long and see nothing - Its happened to me countless times.

If you're willing to dig into the preprocessing code a bit to get better results (until I push some), you can try:

  1. Using SpaCy's en_core_web_lg model instead of en_core_web_sm.
  2. Try token_type=lemma instead of lower (this is what Moody supposedly used in his paper, but in his online example he used lower). If you do this, remove "tagger" from the list of disabled pipelines, because otherwise it won't lemma correctly (as per this issue issue)

Preprocessing will be much slower with these settings, but it helps.

gveni commented 5 years ago

Thank you, let me follow your suggestions and see if that helps. Thanks!

On Sat, Mar 16, 2019 at 1:43 AM Nathan Raw notifications@github.com wrote:

Thank you for commenting on an already existing issue and not making a new one!!

Default parameters are actually not the best, as far as I've seen. Perhaps more tweaks should be made and pushed. Believe me, I know how frustrating it is to wait that long and see nothing - Its happened to me countless times.

If you're willing to dig into the preprocessing code a bit to get better results (until I push some), you can try:

  1. Using SpaCy's en_core_web_lg model instead of en_core_web_sm.
  2. Try token_type=lemma instead of lower (this is what Moody supposedly used in his paper, but in his online example he used lower). If you do this, remove "tagger" from the list of disabled pipelines https://github.com/nateraw/Lda2vec-Tensorflow/blob/76edee49c9f33164a0e4fc63fd12da8de0594fcc/lda2vec/nlppipe.py#L41, because otherwise it won't lemma correctly (as per this issue https://github.com/explosion/spaCy/issues/1901 issue)

Preprocessing will be much slower with these settings, but it helps.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/4#issuecomment-473502333, or mute the thread https://github.com/notifications/unsubscribe-auth/AED_n7SpJn6naaHntnK1xaqjfVNHcBquks5vXISUgaJpZM4U7C0X .

nateraw commented 5 years ago

@gveni I updated the conditionals in nlppipe.py to be more in line with what we are looking for. Make sure you use en_core_web_lg spacy model.

gveni commented 5 years ago

@nateraw/Lda2vec-Tensorflow reply@reply.github.com:Thanks Nathan for your help. Following your previous suggestions, I have used Spacy's en_core_web_lg as pre-trained model and token_type as lemma and I ran it from 5000 epochs, but to no avail. I see that you have also modified line 67 of nlppipe python file related to tokenizing texts as per your previous email. Let me see if that gives any positive impact on getting better result. I will let you know.

Thanks, Gopal

On Sun, Mar 24, 2019 at 8:16 PM Nathan Raw notifications@github.com wrote:

@gveni https://github.com/gveni I updated the conditionals in nlppipe.py https://github.com/nateraw/Lda2vec-Tensorflow/blob/583ff52d716bcab883b46bb559a00eccf608bf04/lda2vec/nlppipe.py#L67 to be more in line with what we are looking for. Make sure you use en_core_web_lg spacy model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/4#issuecomment-476031573, or mute the thread https://github.com/notifications/unsubscribe-auth/AED_nxuybwduS7ZwdX6WCGRbAEgdpsCsks5vaDFdgaJpZM4U7C0X .