piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.6k stars 4.37k forks source link

Discussion: discard "gensim.summarization"? #2592

Closed gojomo closed 4 years ago

gojomo commented 5 years ago

In the course of considering the list question at https://groups.google.com/d/msg/gensim/v24RI3-oUq0/NYlPpif1AQAJ, I took a slightly-deeper look at gensim.summarization than before.

From that look, my opinion is that its presence is more likely to waste peoples' time than help them. It's fairly rudimentary functionality, but spread across many files, with its own non-configurable regex-based word- and sentence- tokenization, with a lot of hard-to-follow steps. None of the doc/tutorial examples show impressive results.

I even find it hard to imagine anyone getting satisfactory results from this approach, so I expect most peoples' interaction with this code is: (1) "I need summarization – and cool, gensim has a summarization feature!" (2) View its docs/tutorial and try on some real data. (3) "This is nowhere near what I need nor is it customizable/fixable enough to be tweaked into service." (4) They look for something else entirely.

I'd suggest marking the whole module 'deprecated' with an eye towards eventual removal. And, if summarization is an important thing to truly support, soliciting someone to work-up a better algorithm or implementation, one that can actually demo some useful results in a tutorial/demo, and that also mixes well with other corpus-format/tokenization practices in gensim. (It might even be TextRank-based – but with configurable tokenization & sentence-similarity/graph-building steps.)

piskvorky commented 5 years ago

+1 on that. IIRC, the algo is actually OK / standard, but the technical execution (engineering, design) was poor.

One of the (several) modules in Gensim I'd be scared to use myself, and consequently never did.

Discussions go on the mailing list though, why did you open it here?

gojomo commented 5 years ago

Opened this here because this seemed to me more like a committer-level discussion regarding quality/standards/policies. Also, it'd ideally yield tangible issue-like followup steps, if there was agreement, for which the issue could then record the motivating reasoning & decisions. That's a bit like the prior GH-issue to discuss when/whether Python2-support should be dropped, or the GH-issue asking whether issues themselves should auto-close after deadlines. It's essentially a "feature request" in reverse: a "de-feature request". But happy to discuss there instead or also, as appropriate.

I've generally not been too impressed with "extractive summarization" – it seems to only be useful when the original text was already well authored, in a hierarchical & expository "reference" style. There, extractive summarization has a fair chance of finding the inherently-summarizing sentences/passages the author already included. (Elsewhere, it stumbles hard – as on some of the winding-plot-narratives that some of the tutorial code for this feature has inexplicably chosen to highlight.)

So to the extent TextRank or some other extractive method survives, it'd be helpful to more specifically set expectations. For example, get the name of algorithm (textrank) into the module or function-name, or the type of summarization (extractive), or the essential limitation on its kind of output (sentences_subset).

And, docs/tutorials could highlight some kinds of texts on which it works well, and others where it doesn't. (One potential evaluative method, for a method that's not order-dependent in its choice of sentences: shuffle all the sentences in a Wikipedia article together, run the algorithm, consider those algorithms that choose more sentences from the article's actual 'summary' section-number-0 to be better.)

From what I've read of TextRank, it seems its method of calculating sentence-to-sentence similarity (and thus the edges on its sentence-to-sentence graph) could be pluggable, and methods based on average-of-word-vectors, or doc-vectors, or WMD-similarity might work quite well compared to the current code (which if I've read right just checks nearly-exact-word overlap).

mpenkov commented 5 years ago

+1 for deprecation and eventual removal.

Perhaps this is something we should do in the next major release?

fredzannarbor commented 3 years ago

There are still a lot of places on the web that recommend using gensim.summarization, so this was not super helpful.

gojomo commented 3 years ago

There are still a lot of places on the web that recommend using gensim.summarization, so this was not super helpful.

@fredzannarbor It'd be helpful if you let those places know they now need to make some other better recommendation!

ismailhammounou commented 3 years ago

Do you have any recommendation for bm25 ? There is a tuto that I want replicate in a my use case and It still uses BM25

gojomo commented 3 years ago

Do you have any recommendation for bm25 ? There is a tuto that I want replicate in a my use case and It still uses BM25

If a tutorial/approach worked well with the older Gensim version, you can always choose to install & use that older version, for example in an isolated, project-specific virtual environment. Only if you also need closely-integrated later-version features or fixes would there be any complications.

(And, if you really like some of the removed code, & are sure it meets your needs, you can always copy the source code into your own project, adapting names/prerequisites lightly as necessary. Just remember that the choice to remove things has usually been driven by an assessment that the code had limitations that made it hard to officially support, often including no one active in the project with the knowledge/interest to answer questions or investigate issues.)

bwindsor22 commented 3 years ago

:( @gojomo so if I need, e.g. text split by sentences, I need a dependency on something like NLTK?

Would having more maintainers help in a decision like this?

gojomo commented 3 years ago

Yes, if you need a text split by sentences, using a project that has well-maintained code for doing that is wise.

That's what Gensim itself would want to do, if any of its current algorithms needed to split text into sentences. (In general, they don't.)

The prior code for this in gensim.summarization.textcleaner.get_sentences() wasn't very good, given other better options just a pip install away.

But also, it was about 2 lines of crude regex-based string splitting. If that's all you need, it's easy to copy. See:

https://github.com/RaRe-Technologies/gensim/blob/release-3.8.3/gensim/summarization/textcleaner.py#L37

https://github.com/RaRe-Technologies/gensim/blob/release-3.8.3/gensim/summarization/textcleaner.py#L147-L173

Witiko commented 3 years ago

Although I agree with the removal of the gensim.summarization module, Okapi BM25 is the standard baseline for question answering and information retrieval, which outperforms TF-IDF and Log-Entropy even with parameter tuning.

Is there any suitable replacement for gensim.summarization in the context of information retrieval at the moment? I am aware of the rank-bm25 library, which is fast and easy to set up, but also incompatible with Gensim's Dictionary and techniques for query expansion, such as SoftCosineSimilarity. If not, would there be any objections against creating a gensim.models.bm25 module, which would provide a model with similar interface to gensim.models.tfidf? It's missed.

piskvorky commented 3 years ago

+1 on including BM25 in Gensim. We'll just need to vet the code better.

But I don't expect it will a problem with your code.

Witiko commented 2 years ago

I implemented BM25 and opened PR #3304 on April 2. Quantitative results on information retrieval show marked improvement over TF-IDF and compatibility with existing implementations such as rank-bm25. I will appreciate your comments and reviews.

fredzannarbor commented 2 years ago

How can one now accomplish summarization with gensim?

gojomo commented 2 years ago

How can one now accomplish summarization with gensim?

There's no summarization functionality in current versions. You could try a 3.x version, & if the results work well for you, keep using that old version, or copy its source-code into your project.

If you want state-of-the-art summarization – including potentially abstractive (paraphrasing) summarization not just a crude selection of some subset of guessed-important sentences that the previous Gensim extractive summarization provided – and have sufficient resources, you could look at newer, deeper large language models, like BERT/etc.

dogayagcizeybek commented 1 year ago

Artificial intelligence has been evolving rapidly, and we can enhance the functionality of a simple open-source library both algorithmically and with a database-based approach. Since it hasn't been explicitly stated that summarization must be algorithm-based, I would like to request bringing back this idea. What is your perspective on starting a pull request for this thought? We could add a warning during the development phase to ensure that it doesn't consume people's time until satisfactory results are achieved.

gojomo commented 1 year ago

Since it hasn't been explicitly stated that summarization must be algorithm-based, I would like to request bringing back this idea. What is your perspective on starting a pull request for this thought?

Do you mean a PR to restore exactly the old code?

I think that'd be silly - it was bad code, poorly maintained, without any public examples of it providing good results, that as far as I could tell wasted the time of most people who tried it. A mere documentation or code-comment or even printed-to-console warning that the code is likely to disappoint people doesn't, in my experience, provide enough discouragement. They're still tempted by the label, or misleading old examples online – & thus waste their time, & ours.

But still, if people really want it, maybe they have one of the rare tasks where this technique has good results. (I've seen people report this, but never seen any working demo of this code, on even contrived/cherry-picked data, showing useful results.)

In that case, they can fetch the code out of the old versions. It's easy to get, it's not that long, it's open-source.

As mentioned in the initial 2019 discussion, if someone wanted to make a more-generalizable and more-maintainable implementation of the 'TextRank' algorithm on which this gensim-summarization was based, that might have a case.

With pluggable word/sentence tokenization, & pluggable/configurable sentence-centrality-ranking options, this kind of early extractive text summarization algorithm might still be useful against some well-written texts, or interesting didactically about the limits of summarization capabilities before deep neural networks.

But here in 2023+, even an excellent & flexible implementation of TextRank-style, sentence-excerpts summarization will be far worse than what's cheap & easy with modern LLMs.

fredzannarbor commented 1 year ago

A couple of thoughts about this.

  1. “Cheap and easy” is not free. Useful to have free summarization built into the package.
  2. Extractive summarization is an important alternative because you can rely on the words in the summary being the same as the original source. For some applications that’s essential.
  3. Modern LLMs still struggle with context window size. It’s crucial to have at least one tool that can summarize very long documents as a whole, ideally not constrained by memory size.

Fred Zimmerman, Publisher Nimble Books LLC The AI Lab for Book-Lovers http://NimbleBooks.com

On Jul 5, 2023 at 1:46:16 PM, Gordon Mohr @.***> wrote:

Since it hasn't been explicitly stated that summarization must be algorithm-based, I would like to request bringing back this idea. What is your perspective on starting a pull request for this thought?

Do you mean a PR to restore exactly the old code?

I think that'd be silly - it was bad code, poorly maintained, without any public examples of it providing good results, that as far as I could tell wasted the time of most people who tried it. A mere documentation or code-comment or even printed-to-console warning that the code is likely to disappoint people doesn't, in my experience, provide enough discouragement. They're still tempted by the label, or misleading old examples online – & thus waste their time, & ours.

But still, if people really want it, maybe they have one of the rare tasks where this technique has good results. (I've seen people report this, but never seen any working demo of this code, on even contrived data, showing useful results.)

In that case, they can fetch the code out of the old versions. It's easy to get, it's open-source.

As mentioned in the initial 2019 discussion, if someone wanted to make a more-generalizable and more-maintainable implementation of the 'TextRank' algorithm on which this gensim-summarization was based, that might have a case. With pluggable word/sentence tokenization, & pluggable/configurable sentence-centrality-ranking options, this kind of early extractive text summarization algorithm might still be useful against some well-written texts, or interesting didactically about the limits of pre-deep-neural-networks summarization capabilities.

But here in 2023+, even an excellent & flexible implementation of TextRank-style sentence-choosing summarization will be far worse than what's cheap & easy with modern LLMs.

— Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-1622210732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI4TS3YNEOWYWA3KBCNTXLXOWR6RANCNFSM4ITKAWEQ . You are receiving this because you were mentioned.Message ID: @.***>

gojomo commented 1 year ago

You didn't clarify whether your proposal is to bring back the old code, but your allusion to 'free' suggests that might be what you are suggesting.

  1. “Cheap and easy” is not free. Useful to have free summarization built into the package.

There was never any truly 'free' summarization in the past, nor is any possible in the future. The prior code was low quality. Users wasted time & effort, which is not free, trying to get it to work. Maintainers faced questions from frustrated users, which impose costs even when the answer is, "no help is available".

(And, with compact & open-source LLMs, those options are potentially as close to 'free' as anything else.)

  1. Extractive summarization is an important alternative because you can rely on the words in the summary being the same as the original source. For some applications that’s essential.

I am unfamiliar with applications where using the exact same words is essential. Can you provide some links to representative applications where that's better than high-quality abstractive summarization?

As I've mentioned, I've never seen any texts on which the old code delivered good results. (Our own demo notebook showed only poor/nonsense results.) If you know of cases where this has been shown to work well, can you provide links?

To the extent someone really wanted to retrieve the "most representative" verbatim sentences from a longer work, as a sort of IR task, I suspect that applying other algorithms better-supported in Gensim – LDA, average-of-word-vectors, WMD, Doc2Vec, etc – would select better excerpts than the prior crude gensim.summarization code.

If such selection-of-verbatim-excerpts is a real need driving your request, I suggest trying some of those other algorithms.

But also, if you have any published or private evaluations showing the old gensim.summarization code doing better than extant alternatives, that would be useful to see. It doesn't seem likely, from my read & tests. (On what texts have you applied the code & reviewed its results?)

  1. Modern LLMs still struggle with context window size. It’s crucial to have at least one tool that can summarize very long documents as a whole, ideally not constrained by memory size.

A tool that could effectively summarize arbitrarily long documents would be useful!

I've seen no evidence the old code could serve as that tool.

Among its other substandard aspects: it required entire documents in memory, and its analysis required a massive expansion in memory use. Even after the fixes in #2298, it was reported to fail with a MemoryError on a 16GB RAM machine when trying to summarize a text under 4MB in size (Tolstoy's full 'War and Peace').

If you think you've found an extractive-summarization technique that could outcompete an LLM due to an LLM's window-size limitations, I'd want to see some credible evaluations demonstrating that, including that it outperforms the most-simple plausible LLM workaround: summarize acceptably-sized chunks, concatenate those summaries, repeat. It doesn't seem likely to me that any extractive approach would be competitive, but I'd enjoy being suprised if that can be shown!

fredzannarbor commented 1 year ago

Seems from the tone and amount of feedback that you really don’t want to do this, which is fine. I don’t use genesis anymore - I stopped when you dropped the summarization tool.

  1. “Free to me” is a very important consideration for users. Up to you as a developer whether you want to bear the cost.
  2. Law - citing cases. History - citing documents. Many situations where paraphrasing is unacceptable, this should be obvious.
  3. Concatenation and recursion are more cumbersome than a single function or command line call, which is what I am looking for. As I noted, extraction from large documents is sometimes preferable to abstraction.

Message ID: @.***>

gojomo commented 1 year ago

The old code is still there, free to use if it works well for your needs. (You can install older versions of Gensim on request, or copy & paste the relevant source code into your projects.)

And I'm still interested for actual viewable examples where it worked well – I've still never seen one.

I sympathize if your historic use might be too private/proprietary to share details, but in the absence of any public examples of this particular code working well, it's hard to justify any cost of maintenance/user-frustration.

By my understanding, quoting (to support a specific point) is very different than summarization. And trusting the old code's excerpts to reflect the original faithfully would be unwise - its technique couldn't be sure if a sentence were a quote of arguments the main document was refuting, or holding up to ridicule.

And my other point remains: the other (stronger, better documented, test-case-covered, better-coded, easier-to-demonstrate) similarity-algorithms can likely find representative excerpts, to quote verbatim if that is necessary, even better than the very fragile/crude/underpowered/inefficient/idiosyncratic gensim.summarization did. Anyone who needs such functionality should try them in that role.

Simple concatenation & recursion can easily be bundled in a single function call in user code. The claim "LLMs can't do this - unless you put their operations into a simple loops of a few lines of code" isn't really the same as "LLMs can't do this".

fredzannarbor commented 1 year ago

I agree with your planned way forward. There are better alternatives than gensim.oldsummarization. I will only observe that in my spot testing, gensim’s old summarizer did pretty well at pulling out 20 significant sentences from book-length manuscripts. I was happy with the results, but that is scarcely scientific.

On Jul 5, 2023 at 11:50:55 PM, Gordon Mohr @.***> wrote:

The old code is still there, free to use if it works well for your needs. (You can install older versions of Gensim on request, or copy & paste the relevant source code into your projects.)

And I'm still interested for actual viewable examples where it worked well – I've still never seen one.

I sympathize if your historic use might be too private/proprietary to share details, but in the absence of any public examples of this particular code working well, it's hard to justify any cost of maintenance/user-frustration.

By my understanding, quoting (to support a specific point) is very different than summarization. And trusting the old code's excerpts to reflect the original faithfully would be unwise - its technique couldn't be sure if a sentence were a quote of arguments the main document was refuting, or holding up to ridicule.

And my other point remains: the other (stronger, better documented, test-case-covered, better-coded, easier-to-demonstrate) similarity-algorithms can likely find representative excerpts, to quote verbatim if that is necessary, even better than the very fragile/crude/underpowered/inefficient/idiosyncratic gensim.summarization did. Anyone who needs such functionality should try them in that role.

Simple concatenation & recursion can easily be bundled in a single function call in user code. The claim "LLMs can't do this - unless you put their operations into a simple loops of a few lines of code" isn't really the same as "LLMs can't do this".

— Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-1622931191, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI4TSZQOBAQKEV3P2SL77LXOYYZ7ANCNFSM4ITKAWEQ . You are receiving this because you were mentioned.Message ID: @.***>

gojomo commented 1 year ago

That's helpful to know, even as anecdotal spot testing.

Can you say any more about these texts' sizes in words or sentences, and their domain/style? (EG, were they fiction/non-fiction, academic/popular/governmental, etc?)

I ask because I'm still curious where oldsummarization was providing value – none of our documentation/demo/tutorial examples showed good results, and it may be possible to match/exceed its value with a few dozen lines of other code using better-supported remaining algorithms (& more-standard tokenization functions/libraries).

So the sort of "single function or command line call" functionality you'd like might still be possible, if there were a few more hints about what reference set of texts, & baseline performance, were worth optimizing around.

fredzannarbor commented 1 year ago

I was usually editing nonfiction books on history written for enthusiast audiences, so, a good amount of proper nouns, foreign language, and technical terms.

Fred Zimmerman, Publisher Nimble Books LLC The AI Lab for Book-Lovers http://NimbleBooks.com

On Jul 6, 2023 at 1:04:33 PM, Gordon Mohr @.***> wrote:

That's helpful to know, even as anecdotal spot testing.

Can you say any more about these texts' sizes in words or sentences, and their domain/style? (EG, were they fiction/non-fiction, academic/popular/governmental, etc?)

I ask because I'm still curious where oldsummarization was providing value – none of our documentation/demo/tutorial examples showed good results, and it may be possible to match/exceed its value with a few dozen lines of other code using better-supported remaining algorithms (& more-standard tokenization functions/libraries).

So the sort of "single function or command line call" functionality you'd like might still be possible, if there were a few more hints about what reference set of texts, & baseline performance, were worth optimizing around.

— Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-1624015758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI4TS7CIS5QPUTSDEACQ3DXO3V2DANCNFSM4ITKAWEQ . You are receiving this because you were mentioned.Message ID: @.***>