vlukiyanov / pt-avitm

PyTorch implementation of AVITM (Autoencoding Variational Inference For Topic Models)
MIT License
36 stars 6 forks source link

Using your example how do we get the top 'k' topics? #25

Closed lionely closed 5 years ago

lionely commented 5 years ago

Thank you for the code! I am trying to use it to get topics but

result = pipeline.transform(texts)
score = pipeline.score(result)

result is a numpy array, what does score take? Help would be greatly appreciated! Great code once again!

vlukiyanov commented 5 years ago

The score should be coherences, see https://github.com/vlukiyanov/pt-avitm/blob/master/ptavitm/sklearn_api.py. The result should be the topic distribution for your input texts, you could apply max to that to find the top topic for example.

lionely commented 5 years ago

Thank you for the response! But how would we get back the actual words? These operations will only return numbers right?

vlukiyanov commented 5 years ago

These functions map texts to topics, if you want to find out which words make up a topic you can do something similar to https://github.com/vlukiyanov/pt-avitm/blob/7df1fbd86bbbbe3660e58f839e798beb92c025d4/examples/20news/20news.py#L104; there isn't really any way to do that using the scikit-learn style API at the moment (if you do end up writing any utility functions to do that please PR, I just haven't yet had the time to add them).

lionely commented 5 years ago

But if you were to return result like in your example on the README, how would you get from that back to words?

Nevermind!

vlukiyanov commented 5 years ago

I think the best way would be to add a method to the transformer object to report this for the topics, so if you look at https://github.com/vlukiyanov/pt-avitm/blob/7df1fbd86bbbbe3660e58f839e798beb92c025d4/ptavitm/sklearn_api.py#L39 then the autoencoder attribute is accessible, so the same method I linked to above would work. I'd like to add it as a method to the transformer. You would then be able to get the top words for a topic and the top topics for any text after fitting the model.

lionely commented 5 years ago

So I'm not sure if the code works

texts = summary_grouped_country_df[0]
count_vec = CountVectorizer(stop_words='english', max_features=2500, max_df=0.9)
plda = ProdLDATransformer(topics=10)
pipeline = make_pipeline(count_vec, plda)

#print(texts[0:])
start = time.time()
pipeline.fit(texts)
end = time.time()
elapsed = end-start
print("Took %f to train topic model" % elapsed)
result = pipeline.transform(texts)
print(count_vec.get_feature_names())
print(count_vec.inverse_transform(result[0]))

So this the count_vectorizer has an inverse transform which will take the indicies from the output of the neural network to return the topics. And all it seems to be doing is When I specify the number of topics in this case 10, the final return will just be the first 10 entries in

count_vec.get_feature_names()

Maybe I'm doing something wrong but I imagine this is similar to what you are doing in the previous posts you linked to only you didn't use the vectorizer.

Example of count_features :

['000', '10', '100', '1000', '11', '12', '13', '130', '14', '15', '150', '16', '16th', '17', '17th', '1866', '18th', '19', '1900s', '1903', '1905', '1919', '1920s', '1924', '1928', '1933', '1948', '1955', '1963', '1970s', '1971', '1976', '1979', '1980', '1980s', '1983', '1984', '1985', '1987', '1990s', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '19th', '20', '200', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017'

Example of count_vec.inverse_transform(result[0]) :

 [array(['000', '10', '100', '1000', '11', '12', '13', '130', '14', '15'],
      dtype='<U15')]

Sorry for this long post!

Update: I also tried the other way you were suggesting and the result was the same. :=(

vlukiyanov commented 5 years ago

I think the misunderstanding above is that the outputs of the encoder network i.e. pipeline.transform(texts) are not related to the vocabulary, but rather the topics; the encoder encodes the document represented as a bag of words in the vocabulary to a topic representation.

Once the model is fitted you can find the top vocabulary weights for each topic, so assuming the start of the example is

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from ptavitm.sklearn_api import ProdLDATransformer
pipeline = make_pipeline(
    CountVectorizer(
        stop_words='english',
        max_features=2500,
        max_df=0.9
    ),
    ProdLDATransformer(topics=10)
)
pipeline.fit(texts)
result = pipeline.transform(texts)

You can then get the vocab out like this, I think

vocab = pipeline.steps[0][1].vocabulary_
reverse_vocab = {vocab[word]: word for word in vocab}
indexed_vocab = [reverse_vocab[index] for index in range(len(reverse_vocab))]

And the top words for each topic, at least by the decoder weight, as

decoder_weight = pipeline.steps[1][1].autoencoder.decoder.linear.weight.detach().cpu()
topics = [
    [reverse_vocab[item.item()] for item in topic]
    for topic in decoder_weight.topk(10, dim=0)[1].t()
]
for index, topic in enumerate(topics):
    print(index, ','.join(topic))

For example using a small subset of the 20 newsgroups dataset I get something like this

0 newsreader,b8f,running,thanks,se,keywords,anybody,3t,ma,netcom
1 bible,christ,god,faith,christian,believe,religion,church,christians,think
2 sale,__,games,34,1t,internet,cwru,a86,17,players
3 israeli,team,jews,women,israel,hockey,city,home,san,25
4 pitt,writes,cwru,car,engineering,uiuc,article,cc,colorado,com
5 image,0d,card,files,disk,windows,chip,key,memory,jpeg
6 b8f,a86,34u,145,ax,34,_o,3t,0d,0t
7 video,__,sale,card,hp,chip,disk,windows,se,fax
8 israeli,israel,jews,ok,cleveland,0d,newsreader,law,al,chip
9 book,netcom,taken,change,box,today,week,tried,hard,memory

Now I can look back at my outputs of the pipeline and do something like

df = pd.DataFrame({'text': texts, 'topic': result.argmax(1)})

And then I have a mapping from the input text to the top topic index.

Note of caution, I'd not recommend using (this implementation of) AVITM for anything serious unless you have time to tune it, this implementation is mostly a curiosity to understand how it works in PyTorch!