Closed tmylk closed 6 years ago
However, I feel that it will need a few modifications to its API for fit in nicely with the rest of Gensim, and would like some advice form the core Gensim developers. My questions are
Once I have answers to these questions, it shouldn't take too long to modify my code accordingly.
Nice meeting you again yesterday. I will put this algo on our student page.
I'll be happy to advise any student who takes this project on.
If there is interest for this and no one else wishes to take it up, I would like to give it a shot. :)
@bhargavvader sounds good, thanks!
@tmylk can you add some context to this ticket? What is "Montemurro and Zanette algorithm"?
Here's a link to a paper describing the algorithm.
@tmylk ticket context still missing, update.
@piskvorky Could you please suggest a way to add context? The context is clear to me, with relevant links. There is even a volunteer contributor.
Sure -- something along the lines of "Here's a problem / motivation; here's what we could do to solve it".
The first part is missing -- from the link it's not apparent to me what "Montemurro and Zanette algorithm" does, and the linked implementation doesn't explain it either (that I can see).
If this is implemented in gensim, what will it actually do? Who is it for?
The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.
Aha, thanks @petebleackley. So this is a candidate to replace the summarization.keywords
package, if I understand correctly @tmylk .
It would be interesting to compare them side-by-side, see which algo works better (and deprecate the other one -- we don't want to maintain dead weight in gensim).
Or if the algorithms have non-overlapping strengths/weaknesses, document what they are. When should users use one or the other? Is there a standard benchmark? (@tmylk Qs for the incubator project)
I've implemented this in https://github.com/RaRe-Technologies/gensim/pull/1738. However, there is a merge conflict in summarization/init.py that needs to be resolved.
The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.
Dr Peter J. Bleackley has kindly suggested his implementation