noseworm / convai

4 stars 1 forks source link

[New Feature] Topic model to extract latent topics for each article #1

Closed koustuvsinha closed 6 years ago

koustuvsinha commented 7 years ago

Could be useful to implement a topic classification model using fasttext to extract the topics which are being talked in the article about.

Implementation plan:

NicolasAG commented 7 years ago

Could be useful! but if we only have the topic of the article it can only be used as a Feature. We should try to think about a model that can actually talk about that topic. We already have a Question answer model (DrQA), maybe a question generator should be good? :)

koustuvsinha commented 7 years ago

Used a set of topics collected from Reddit and StackExchange. The set of general topics finally selected by tracing the parents from each node from Wibi Taxonomy and selecting k closest parents based on doc2vec vectors. Set of 651 topics extracted can be found here. Now, for each topic I collect its wikipedia document and k nearest children documents, which forms the training data.

Next, using fasttext I am training the topic model following the parameters mentioned in the paper. I will document the exploration and evaluation scores here:

koustuvsinha commented 7 years ago

On further evaluation of the trained model it seems the classifier has been over-trained on some high level topics (such as Process, Work, Body, Practice) and every document is being labelled as one of those high level topics. (Because during training, we take the children of every topic from the Wiki tree)

Is there any open resource of a set of general topics to use? Such as "Music", "Sports", "Religion" etc? @NicolasAG

koustuvsinha commented 7 years ago

@Breakend ?

koustuvsinha commented 7 years ago

Update: used Yahoo News corpus , which has a good overall mixture of topics :

Society & Culture
Science & Mathematics
Health
Education & Reference
Computers & Internet
Sports
Business & Finance
Entertainment & Music
Family & Relationships
Politics & Government

and trained FastText with the best hyperparams mentioned in the paper. Getting the same precision recall as mentioned in the paper, 0.39 and 0.75. Since the topics are broad enough, I think this would work just fine for our case.

koustuvsinha commented 7 years ago

Todos: