Closed lukakostic closed 6 years ago
Hi! By default we use newlines to divide into sentences. Can you try removing the line breaks from the input text?
What do you mean breaks? Like replacing dots with '\n' or replacing '\b' with '\n' ?
Removing "\n" from the text altogether.
But then wont things like this:
A B
turn to 'AB' in some user made text?
Would it be better to turn '\n' to '.' and remove any duplicate dots (... to .)?
Doing above (replacing newlines with . and making sure only one consecutive dot is allowed), it breaks. Now summarize(text) and summarize text with ratio=0.2 are empty (print nothing) and summarize with words=10 prints "Document summarization is another." , keywords are still same.
Seems it adds dots in places it shouldnt have because the input has newlines in some unnecessary places, but if i remove newlines it can result in words being one word when they shouldnt...
Just removing newlines ("\n" to "") results in same output as turning "\n" to "."
Except that the keywords now has this "technologyis search" which is obviously an artifact of removing newlines:
An example of the use of summarization technology is search engines such as Google
It gives a good result When i manually correct the text to remove un-needed newlines, but the summary with words=10 is empty now. Could it be it cant summarize short enough?
Our summaries consist of the most relevant sentences in a given text. The task of splitting a text into sentences is not solved, so we make a best effort using this regex.
That regex treat different lines (i.e.: a piece of text with \n
) as different sentences. In another project we have evaluated changing this behavior, but at the end decided to keep it as it is, since is an easier task for the user to remove newlines if the text is well formatted. This has got to be better documented, so I created a ticket for that.
The other behavior does seem like a bug. It could be that the summarizer misbehaves when the words
parameter is too small. Can you create a separate issue for that with an example? I will close this one.
Thank you!
With input of
text = """Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. An example of the use of summarization technology is search engines such as Google. Document summarization is another."""
print(summarize(text))
I get
Automatic summarization is the process of reducing a text document with a Document summarization is another.
Which is much derpier and not same as the results you got (in the example):
Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.