quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

Implement TextRank using a dfm #40

Open brousseauj opened 5 years ago

brousseauj commented 5 years ago

It would be great to be able to have a list of documents and quickly be able to summarize the relevant documents. Sometimes things like topfeatures() doesn't give enough context.

koheiw commented 5 years ago

Does "relevant" mean similarity? If so, it is coming soon.

brousseauj commented 5 years ago

Yeah I guess something that would like look at top features and then pull up documents that contain the most of the top features would be considered most "relevant".

kbenoit commented 5 years ago

TextRank is an implementation of PageRank to score sentences, in a way that the sentences with the highest scores can be considered good summaries of a document or a collection of documents. The methodology is described here: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf.

@jwijffels has written an R implementation. Some approaches score sentences based on GloVe (embedding) scores but that package uses standard distance metrics from frequencies. Would be interesting to write a quanteda function that wraps this package.

jwijffels commented 5 years ago

@brousseauj The R implementation I've written in https://cran.r-project.org/web/packages/textrank/index.html looks for word overlap. Nothing stops you from calculating another similarity metric (e.g. tfidf / embedding similarity / whichever other similarity metric) and feeding that similarity function into textrank_sentences. In fact, that's what I do in quite some projects, I now tend to use package ruimtehol https://cran.r-project.org/web/packages/ruimtehol/index.html to calculate sentence embeddings and feed them to textrank to summarise sentences

brousseauj commented 5 years ago

Thats really interesting. I'll have to check out that embedding package! Compared to other embedding packages like doc2vec or word2vec, how is this one? My knowledge of embeddings is quite limited so I apologize for the basic questions!

jwijffels commented 5 years ago

Ruimtehol is an r wrapper around starspace which allows to embed all the things: articles, sentences, words, bigrams, labels, tags, persons, websites, entities and entity relations. Anything quoi.

jwijffels commented 3 years ago

FYI. You can also use R package doc2vec https://github.com/bnosac/doc2vec or R package word2vec https://github.com/bnosac/word2vec to use the embeddings for measuring text similarities to feed them into textrank. I think I'll push doc2vec to cran in the coming weeks (feel free to test & provide feedback), word2vec is already a long time on cran.

kbenoit commented 3 years ago

Thanks @jwijffels, good suggestions. We're modularising quanteda to make it easier to maintain, easier to extend, and easier to integrate. I look forward to putting in some serious work on textmodels soon.