piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.6k stars 4.38k forks source link

Suggesting Replacement of TextRank. #1085

Closed metalaman closed 7 years ago

metalaman commented 7 years ago

I, along with @dust0x and @pranay360, am writing a blog on Extractive Vs. Abstractive Text Summarization for RaRe Technologies. While comparing different extractive models, we found that Sumy's Lex_Rank not only performs better than Gensim's Text Rank, but is also faster. Summarization was performed on Opiniosis dataset(51 articles with 5 gold summaries each) and models were compared using ROUGE-1 score. ROUGE-2 score for Lex_Rank(0.054) was better than Gensim's Text Rank(0.034). Sumy's Lex_Rank also seems to perform faster: 1 loop, best of 3: 41.2 s per loop Gensim's Text Rank: 1 loop, best of 3: 57.7 s per loop Link to the draft of the blog is here Example of summaries generated are in the blog. We suggest replacing the BM25 ranking function with Lex_Rank's idf-modified-cosine distance.

aayux commented 7 years ago

cc @piskvorky @tmylk

piskvorky commented 7 years ago

Sounds great. What is involved in this "upgrade", technically speaking?

Is it a clean, backward-compatible change, or are there further changes needed to the API?

metalaman commented 7 years ago

Using Lex_Rank's idf-modified-cosine as weights of the edges in the graph instead of using BM25 looks like the only change that would be required. More on Lex_Rank's idf-modified-cosine here.

aayux commented 7 years ago

Perhaps we could have LexRank as an option in addition to TextRank like Sumy does.

tmylk commented 7 years ago

@dust0x The goal is to remove outdated algorithms from Gensim. So if Lex_Rank is better then we would remove TextRank and have a wrapper for Lex_Rank instead. Ideally preserving the same API so that the user notices no difference.

metalaman commented 7 years ago

Found following papers, where text rank is performing better: http://ltrc.iiit.ac.in/icon/2013/proceedings/File49-paper108.PDF http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2009/Garg09-CAG.pdf http://web.fi.uba.ar/~fbarrios/tprofesional/articulo-en.pdf

TextRank performs a little better than LexRank.

tmylk commented 7 years ago

Closing as existing TextRank has 6% better peformance and much easier implementation than LexRank.