miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.46k stars 525 forks source link

Is it possible to get how many texts summarized by the summarizer? #188

Open darwinharianto opened 1 year ago

darwinharianto commented 1 year ago

Suppose I have this kind of text

      Check this out.
      Everything checked out.
      Not so much is checked.
      I am not sure what is happening.
      The dog is burnt.

Running the LSA algorithm with 3 sentences count gives me

    Everything checked out. 
    Not so much is checked. 
    The dog is burnt.

Is it possible to get count of summarized text? I assume it would look like this Everything checked out. -> [Check this out., Everything checked out.] -> 2 Not so much is checked. -> [Not so much is checked., I am not sure what is happening.] -> 2 The dog is burnt. -> [The dog is burnt.] -> 1

Can I get this info from the SVD matrix?

Edit: wrong count for number for dog sentence

miso-belica commented 1 year ago

Hello, sorry but I can't see a pattern there. How do you determine which sentences you want to return for given summatised sentence? Once it is sentence before the second time after. Also, what are the numbers? I thought it's count of sentences in context but it is always 2 even for one sentence.

darwinharianto commented 1 year ago

How do you determine which sentences you want to return for given summatised sentence? Once it is sentence before the second time after. The order is following the input order

Check this out. -> 1st sentence
Everything checked out. -> 2nd sentence
Not so much is checked. -> 3rd sentence
I am not sure what is happening. -> 4th sentence
The dog is burnt. -> 5th sentence

The results would be

    Everything checked out. -> close to 1st and 2nd sentence [Check this out., Everything checked out.] -> this sentence represent 2 sentences
    Not so much is checked. -> close to 3rd and 4th sentence [Not so much is checked., I am not sure what is happening.] ->  this sentence represent 2 sentences
    The dog is burnt. ->  not close to anyone[The dog is burnt.] -> this sentence represent itself

Also, what are the numbers? I thought it's count of sentences in context but it is always 2 even for one sentence Ah sorry, I wrote the wrong number count

I am under the impression that LSA algorithm would only show the most distinct sentences and hide those that is already represented by other sentences. Is this correct?

miso-belica commented 1 year ago

Thank you for more info.

I am under the impression that LSA algorithm would only show the most distinct sentences and hide those that is already represented by other sentences. Is this correct?

Yes. you could say it like this I think. LSA works with concept of (very abstract) topics and tries to get representative sentences for them.

I believe if you say "close to 1st and 2nd sentence" you don't mean close as in sentence position but in vector space, right? You would like to know for every sentence in result summary the list of removed sentences it represents in the original text and how many of them. I am afraid there is no easy way how to get this info from the LSA summarizer. Summarizers are a black box and one can tweak them slightly sometimes. But this would require to create completely new one that also picks sentences as you described. But I don't really know how would I approach it.

If this is just one time thing maybe it is easier to use ChatGPT for this 😃

darwinharianto commented 1 year ago

I believe if you say "close to 1st and 2nd sentence" you don't mean close as in sentence position but in vector space, right?

Ah, yes, from the vector space

You would like to know for every sentence in result summary the list of removed sentences it represents in the original text and how many of them. I am afraid there is no easy way how to get this info from the LSA summarizer. Summarizers are a black box and one can tweak them slightly sometimes. But this would require to create completely new one that also picks sentences as you described. But I don't really know how would I approach it.

Yes, I wasn't trying to tweak the LSA itself, I was thinking that maybe looking at the powered_sigma or v_matrix, one could make a relation with it (I believe this is the similarity matrix?).

Something that looks like this.

Screenshot 2023-01-30 at 11 26 12

Maybe, given the similarity matrix, one could get the number of sentences represented by them using hierarchical clustering.

miso-belica commented 1 year ago

Oh, yes. If you are even willing to try some clustering algorithm and make your own modifications to LSA it is definitely doable. You know all the vectors so you can cluster them together. You even now initial cluster leaders (summarized sentences). It sounds like you know what you are doing so should be fine :)

darwinharianto commented 1 year ago

You know all the vectors so you can cluster them together.

Yes, about this. I am not sure how to read this part. How can I get the vector matrices? I believe it is somewhere over here ? Which variable should I look at?

miso-belica commented 1 year ago

It is a bit more complicated. LSA gives you 2 matrices and a vector. I use only one of the matrices but their combination always have some meaning. You can check more in the documentation and link to original article from Steinberger and Jezek.

Here is a relevant result https://github.com/miso-belica/sumy/blob/7fd49700082b217ab254bbc6ae4ca72404985f24/sumy/summarizers/lsa.py#L45 Here the computation of the sentences for the topics https://github.com/miso-belica/sumy/blob/7fd49700082b217ab254bbc6ae4ca72404985f24/sumy/summarizers/lsa.py#L119-L120

Also keep in mind I implemented the library years ago and giving you advises from my poor memory and what is see in the code now. Can't dedicate more time to study the LSA details a to advise you more unfortunately.