nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

Working Example #8

Closed nateraw closed 6 years ago

nateraw commented 6 years ago

I've been working on this code base for quite a while, but I have still yet to see a working example. I've played with calculating the loss function differently, all sorts of hyperparameters, and different ways of preprocessing the data, but yet I still havent seen this code or the original author's actually work.

So, if anybody wants to contribute an example that is reproducible, please let me know! Let me know if I can help explain whats going on in any way in any of the files. Thank you.

stalhaa commented 5 years ago

@nateraw , u r right but according to thishttps://www.mastodonc.com/2016/08/03/better-topic-detection-in-text-with-lda2vec/, lda2vec predict much better topics as compared to LDA or others, but the results of LDA are much better than those extracted from lda2vec. So how can i prove lda2vec as more better than LDA ? suggestions needed.

nateraw commented 5 years ago

@stalhaa As noted in the article you linked to, Lda2Vec takes a significant amount of time to train compared to LDA. Additionally (from my experience), it is wildly variable based on the preprocessing you do. I feel like "proving" the topics are better is sort of a qualitative thing. However, there is a technique mentioned in the paper that allows you to check the learned topics against labeled topics (I'm on mobile, otherwise I'd check). My suggestion would be to look into that if you truly want to "prove" the learned topics are better than LDA alone.

The coolest thing mentioned in the paper, which inspired me to work on this, was that you could potentially add multiple contexts to learn topics over other things. For example, you could have document embeddings and also embeddings for some other arbitrary thing (such as customer embeddings). However, both the authors tests and my tests locally have shown these dont offer very fruitful results. If this algorithm were to be changed a bit from an architecture standpoint, you might be able to get better results. I've been trying to figure out how to do this with attention, with the hope that you could pinpoint multiple topics in a given sequence with attention pointing to where those topics lay. Again, my tests (not uploaded) havent been fruitful.

Sorry if this is a roundabout answer, I just figured I'd give you a little more info and context as to why I said what I said earlier.

dbl001 commented 5 years ago

In simple LDA, we generate prior topic weights according to an Dirichet distribution. To train, we iteratively reassign word - ‘w’ to a new topic, where we chose topic t with probability:

p(topic t | document d) * p(word w | topic t)

Since we have no ‘ground’ topics for each word, computing loss is in this sense (as in supervised learning) - is not applicable.

In Lda2vec-tensorflow the prior is generated by

doc_prior = DL.dirichlet_likelihood(self.mixture_doc.Doc_Embedding, alpha=self.alpha)

During training, how does Lda2vec-tensorflow reassign each word to a ‘better’ topic? E.g. - loss_lda = self.lmbda fraction self.prior(docs)

On Feb 5, 2019, at 10:30 PM, Nathan Raw notifications@github.com wrote:

@stalhaa https://github.com/stalhaa As noted in the article you linked to, Lda2Vec takes a significant amount of time to train compared to LDA. Additionally (from my experience), it is wildly variable based on the preprocessing you do. I feel like "proving" the topics are better is sort of a qualitative thing. However, there is a technique mentioned in the paper that allows you to check the learned topics against labeled topics (I'm on mobile, otherwise I'd check). My suggestion would be to look into that if you truly want to "prove" the learned topics are better than LDA alone.

The coolest thing mentioned in the paper, which inspired me to work on this, was that you could potentially add multiple contexts to learn topics over other things. For example, you could have document embeddings and also embeddings for some other arbitrary thing (such as customer embeddings). However, both the authors tests and my tests locally have shown these dont offer very fruitful results. If this algorithm were to be changed a bit from an architecture standpoint, you might be able to get better results. I've been trying to figure out how to do this with attention, with the hope that you could pinpoint multiple topics in a given sequence with attention pointing to where those topics lay. Again, my tests (not uploaded) havent been fruitful.

Sorry if this is a roundabout answer, I just figured I'd give you a little more info and context as to why I said what I said earlier.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-460915435, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2xt0rko3XV0ru1OlWLaBPN90hVh4ks5vKnaNgaJpZM4VIAiH.

stalhaa commented 5 years ago

@nateraw , @dbl001 is there any single limitation of LDA which can be overcome by applying lda2vec ??

dbl001 commented 5 years ago

LDA generates dense topic vectors which are hard for humans to interpret. LDA2VEC generates sparse topic vectors. These vectors are similar to the word2vec embeddings, which allow vector operations like:

King - man + women = queen.

LDA2VEC vectors should support these vector operations on topics which summarize the document collection.

@nateraw, please correct me if I’m wrong.

On Feb 16, 2019, at 12:21 PM, stalhaa notifications@github.com wrote:

@nateraw https://github.com/nateraw , @dbl001 https://github.com/dbl001 is there any single limitation of LDA which can be overcome by applying lda2vec ??

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-464381039, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2zfCSoujCkDrXLSONpPw2flkUaPCks5vOGhGgaJpZM4VIAiH.

stalhaa commented 5 years ago

@dbl001 plz dont mind me bothering u again for,,plz will u apply my dataset on lda2vec code and fetch 20 topics and vectors also..?? and,plz share ur email id also,so that i can ask u about it ,privately .

dbl001 commented 5 years ago

Sure. What’s your email?

On Feb 23, 2019, at 2:09 AM, stalhaa notifications@github.com wrote:

@dbl001 https://github.com/dbl001 plz dont mind me bothering u again for,,plz will u apply my dataset on lda2vec code and fetch 20 topics and vectors also..?? and,plz share ur email id also,so that i can ask u about it ,privately .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-466635360, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i28G4TfAfv5lFFb_tcfNWoKcGJ4WIks5vQRNKgaJpZM4VIAiH.

stalhaa commented 5 years ago

@dbl001

sana.talha26@gmail.com

ArashDehghanyan commented 3 years ago

I've been working on this code base for quite a while, but I have still yet to see a working example. I've played with calculating the loss function differently, all sorts of hyperparameters, and different ways of preprocessing the data, but yet I still havent seen this code or the original author's actually work.

So, if anybody wants to contribute an example that is reproducible, please let me know! Let me know if I can help explain whats going on in any way in any of the files. Thank you.

Hi Dear I need to keywords and features from product reviews. I saw lda2vec on the internet, although I need to apply it on my own dataset to extract my keywords couldn't any of code producing personal specific topics. can you help me how to reproducing working example. Sincerely yours.

nateraw commented 3 years ago

@ArashDehghanyan I'm no longer able to maintain this repo, so im afraid I can't be of any assistance to you at this time.

My suggestion is to look into some more recent papers + their publicly available implementations...paperswithcode.com can probably help you with this.

This repo can for sure do what youre trying to do, but algorithmicly is tricky to identify the right hyperparams for different datasets. For that reason, I think your time may be better spent on more recent models :smile: cheers, and good luck with your project

dbl001 commented 3 years ago

Top2Vec

https://github.com/ddangelov/Top2Vec

On Sep 9, 2020, at 1:00 PM, Nathan Raw notifications@github.com wrote:

@ArashDehghanyan https://github.com/ArashDehghanyan I'm no longer able to maintain this repo, so im afraid I can't be of any assistance to you at this time.

My suggestion is to look into some more recent papers + their publicly available implementations...paperswithcode.com can probably help you with this.

This repo can for sure do what youre trying to do, but algorithmicly is tricky to identify the right hyperparams for different datasets. For that reason, I think your time may be better spent on more recent models 😄 cheers, and good luck with your project

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-689788134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW57WO2N4NQOY6TRAZDSE7NHXANCNFSM4FJABCDQ.

ArashDehghanyan commented 3 years ago

THank you so much Best regardsاز گوشی هوشمند Samsung Galaxy ارسال شده است. -------- پیام اصلی --------از: Nathan Raw notifications@github.com تاریخ: ۲۰۲۰/۹/۱۰ ۰۰:۳۱ (GMT+03:30) گیرنده: nateraw/Lda2vec-Tensorflow Lda2vec-Tensorflow@noreply.github.com گیرنده کپی: ArashDehghanyan arashserej@gmail.com, Mention mention@noreply.github.com موضوع Re: [nateraw/Lda2vec-Tensorflow] Working Example (#8)

@ArashDehghanyan I'm no longer able to maintain this repo, so im afraid I can't be of any assistance to you at this time.

My suggestion is to look into some more recent papers + their publicly available implementations...paperswithcode.com can probably help you with this.

This repo can for sure do what youre trying to do, but algorithmicly is tricky to identify the right hyperparams for different datasets. For that reason, I think your time may be better spent on more recent models 😄 cheers, and good luck with your project

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

[

{

"@context": "http://schema.org",

"@type": "EmailMessage",

"potentialAction": {

"@type": "ViewAction",

"target": "https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-689788134",

"url": "https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-689788134",

"name": "View Issue"

},

"description": "View this Issue on GitHub",

"publisher": {

"@type": "Organization",

"name": "GitHub",

"url": "https://github.com"

}

}

]

ArashDehghanyan commented 3 years ago

Hi Dear Nathan first of all thank you so much as you said we need to identify the hyperparameters specially number of clusters. would you mind introduce me new models that can automatically determine the number of clusters. I have already used DBSCAN, but didn't obtain good results. Best regards.

On Thu, Sep 10, 2020 at 12:31 AM Nathan Raw notifications@github.com wrote:

@ArashDehghanyan https://github.com/ArashDehghanyan I'm no longer able to maintain this repo, so im afraid I can't be of any assistance to you at this time.

My suggestion is to look into some more recent papers + their publicly available implementations...paperswithcode.com can probably help you with this.

This repo can for sure do what youre trying to do, but algorithmicly is tricky to identify the right hyperparams for different datasets. For that reason, I think your time may be better spent on more recent models 😄 cheers, and good luck with your project

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-689788134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMT2CXRSIDY6AH5MB4B7AHDSE7NHXANCNFSM4FJABCDQ .