[New Feature] Topic model to extract latent topics for each article

noseworm / convai

4 stars 1 forks source link

[New Feature] Topic model to extract latent topics for each article #1

Closed koustuvsinha closed 6 years ago

koustuvsinha commented 7 years ago

Could be useful to implement a topic classification model using fasttext to extract the topics which are being talked in the article about.

Implementation plan:

[x] Get a set of general topics from wibi taxonomy
[x] Get wikipedia articles for each of the above topics, including their children (prune to children having at least 10 child nodes)
[x] Train fasttext
[x] Evaluate

NicolasAG commented 7 years ago

Could be useful! but if we only have the topic of the article it can only be used as a Feature. We should try to think about a model that can actually talk about that topic. We already have a Question answer model (DrQA), maybe a question generator should be good? :)

koustuvsinha commented 7 years ago

Used a set of topics collected from Reddit and StackExchange. The set of general topics finally selected by tracing the parents from each node from Wibi Taxonomy and selecting k closest parents based on doc2vec vectors. Set of 651 topics extracted can be found here. Now, for each topic I collect its wikipedia document and k nearest children documents, which forms the training data.

Next, using fasttext I am training the topic model following the parameters mentioned in the paper. I will document the exploration and evaluation scores here:

[x] default params with epochs 100 - loss 0.39, P@1 - 0.591, R@1 - 0.58
[x] default params with epochs 100 and pretrained word2vec - loss 0.39, P@1 0.59, R@1 0.59
~[ ] ngram 2 epoch 100~
~[ ] ngram 5 epoch 100~
~[ ] ngram 2 epoch 200~
~[ ] ngram 5 epoch 200~

koustuvsinha commented 7 years ago

On further evaluation of the trained model it seems the classifier has been over-trained on some high level topics (such as Process, Work, Body, Practice) and every document is being labelled as one of those high level topics. (Because during training, we take the children of every topic from the Wiki tree)

Is there any open resource of a set of general topics to use? Such as "Music", "Sports", "Religion" etc? @NicolasAG

koustuvsinha commented 7 years ago

@Breakend ?

koustuvsinha commented 7 years ago

Update: used Yahoo News corpus , which has a good overall mixture of topics :

Society & Culture
Science & Mathematics
Health
Education & Reference
Computers & Internet
Sports
Business & Finance
Entertainment & Music
Family & Relationships
Politics & Government

and trained FastText with the best hyperparams mentioned in the paper. Getting the same precision recall as mentioned in the paper, 0.39 and 0.75. Since the topics are broad enough, I think this would work just fine for our case.

koustuvsinha commented 7 years ago

Todos:

[x] Install fasttext in docker
[x] Have a wrapper to directly call the trained model to get the topic
[ ] Have a list of candidate sentences to handle the question