piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.65k stars 4.37k forks source link

Find keywords using entropy with Montemurro and Zanette algorithm #665

Closed tmylk closed 6 years ago

tmylk commented 8 years ago

The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.

Dr Peter J. Bleackley has kindly suggested his implementation

PeteBleackley commented 8 years ago

However, I feel that it will need a few modifications to its API for fit in nicely with the rest of Gensim, and would like some advice form the core Gensim developers. My questions are

  1. Where would be the best place to fit this algorithm into the Gensim project structure?
  2. In what format should the algorithm ingest data? The current implementation is designed for XML, mainly for historic reasons.
  3. In what format should the algorithm return its results?

Once I have answers to these questions, it shouldn't take too long to modify my code accordingly.

tmylk commented 8 years ago

Nice meeting you again yesterday. I will put this algo on our student page.

PeteBleackley commented 8 years ago

I'll be happy to advise any student who takes this project on.

bhargavvader commented 8 years ago

If there is interest for this and no one else wishes to take it up, I would like to give it a shot. :)

piskvorky commented 7 years ago

@bhargavvader sounds good, thanks!

@tmylk can you add some context to this ticket? What is "Montemurro and Zanette algorithm"?

PeteBleackley commented 7 years ago

Here's a link to a paper describing the algorithm.

https://arxiv.org/abs/0907.1558

piskvorky commented 7 years ago

@tmylk ticket context still missing, update.

tmylk commented 7 years ago

@piskvorky Could you please suggest a way to add context? The context is clear to me, with relevant links. There is even a volunteer contributor.

piskvorky commented 7 years ago

Sure -- something along the lines of "Here's a problem / motivation; here's what we could do to solve it".

The first part is missing -- from the link it's not apparent to me what "Montemurro and Zanette algorithm" does, and the linked implementation doesn't explain it either (that I can see).

If this is implemented in gensim, what will it actually do? Who is it for?

PeteBleackley commented 7 years ago

The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.

piskvorky commented 7 years ago

Aha, thanks @petebleackley. So this is a candidate to replace the summarization.keywords package, if I understand correctly @tmylk .

It would be interesting to compare them side-by-side, see which algo works better (and deprecate the other one -- we don't want to maintain dead weight in gensim).

Or if the algorithms have non-overlapping strengths/weaknesses, document what they are. When should users use one or the other? Is there a standard benchmark? (@tmylk Qs for the incubator project)

PeteBleackley commented 6 years ago

I've implemented this in https://github.com/RaRe-Technologies/gensim/pull/1738. However, there is a merge conflict in summarization/init.py that needs to be resolved.