How to match the topic name to prodLDA result?

pyro-ppl / pyro

Deep universal probabilistic programming with Python and PyTorch

http://pyro.ai

Apache License 2.0

8.55k stars 987 forks source link

How to match the topic name to prodLDA result? #2873

Open wanglu2014 opened 3 years ago

wanglu2014 commented 3 years ago

According to the tutorial (the last figure in https://pyro.ai/examples/prodlda.html), the tensor generated by prodLDA.beta() can be matched to the topic names. However, the number of topics in prodlda was manually setted by me, how did we match topics to the tensor of prodLDA.beta()? If our input is a matrix like that, how did we know the rank of words for each topic?

feature.name | A | B | C | D Attr.type | topic | topic | word | word document.1 | 5 | 3 | 2 | 5 document.2 | 4 | 2 | 4 | 8 document.3 | 3 | 1 | 5 | 2

fehiepsi commented 3 years ago

Hi @wanglu2014, it is better to use Pyro forum for questions and discussions. We use github to track the issues and feature requests. For your question, in the tutorial, the rank of a word is determined by the corresponding value in beta. And we need to know the number of topics in advance (it is used in both encoder and decoder). If you know in advance the topic names and their corresponding "vague" word distributions, you can set your priors to encode that knowledge.

wanglu2014 commented 3 years ago

Hi @wanglu2014, it is better to use Pyro forum for questions and discussions. We use github to track the issues and feature requests. For your question, in the tutorial, the rank of a word is determined by the corresponding value in beta. And we need to know the number of topics in advance (it is used in both encoder and decoder). If you know in advance the topic names and their corresponding "vague" word distributions, you can set your priors to encode that knowledge.

Thank you for your timely reply, and I will ask in the forum next time. However, the 20 names of topic only appear in variable news, and I have not seen any input of the training process which including the 20 names. If specific group information of 20 names and words have not been inputted to the model, how did prodLDA match the result to 20 names?

fehiepsi commented 3 years ago

The number of topics is a hyperparameter specified in cell 11. You can change it to 10 if you want.

wanglu2014 commented 3 years ago

The number of topics is a hyperparameter specified in cell 11. You can change it to 10 if you want.

Sorry for not clarify my question. The parameter is obvious 20, however, how to determine the order of 20 names response to original 20 topics? If change it into 10, is it first 10 names in variable news match the result?

fehiepsi commented 3 years ago

I see. So you are talking about supervised learning. LDA does not require pre-defined topics. If you use pre-defined topics, I guess you can just optimize the decoder (i.e. perform maximum likelihood) with the same likelihood Multinomial(doc | total_count, probs=softmax(beta[topic])). (On the other hand, if your target is to perform text classification (i.e. predict the topic of each document), then you can use Categorical likelihood with probs=encoder(docs).)

wanglu2014 commented 3 years ago

I see. So you are talking about supervised learning. LDA does not require pre-defined topics. If you use pre-defined topics, I guess you can just optimize the decoder (i.e. perform maximum likelihood) with the same likelihood Multinomial(doc | total_count, probs=softmax(beta[topic])). (On the other hand, if your target is to perform text classification (i.e. predict the topic of each document), then you can use Categorical likelihood with probs=encoder(docs).)

You are right. I have predefined topics. I want to infer the rank of words for each topic. Input data is like our example. Might other methods can solve this problem？

fehiepsi commented 3 years ago

In that case, I think you can perform maximum likelihood (optimizing the decoder's parameters as I mentioned in the last comment). In Pyro, here is the psuedo-code (for details, you can just mimic the ProdLDA tutorial)

def model(topics, docs):
    beta = pyro.param(...)  # parameter to optimize
    total_count = ...
    probs = sigmoid(beta[topics])  # get the probabilities of each word in a topic
    return pyro.sample("obs", dist.Multinomial(total_count, probs), obs=docs)  # likelihood

def guide(topics, docs):
    pass

svi = SVI(model, guide, ...)