jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 239 forks source link

Create a method to summarize text data #28

Open selimelawwa opened 4 years ago

selimelawwa commented 4 years ago

Overview We need to have a method that take in a pd.Series of text data and be able to summarize it, identify topic, important entities and figures.

Approach Research deep learning approaches to summarizing text and decide on the most suitable one. Use spacy or similar libraries to identify main entities

Help is required and all ideas are welcome!

selimelawwa commented 4 years ago

@jbesomi After initial research, there is 2 types of text summarization:

  1. Extractive Summarization: identify the important sentences or phrases from the original text and extract only those from the text. Will be easier to implement. Please check below related links link1 link2 link3

  2. Abstractive Summarization: This is a very interesting approach. Here, we generate new sentences from the original text. This is in contrast to the extractive approach we saw earlier where we used only the sentences that were present. The sentences generated through abstractive summarization might not be present in the original text. This uses deep learning and is more concerned by the semantic meaning of words. link1 link2

jbesomi commented 4 years ago

Hi Selim!

Not that for identify entities, we have texthero.nlp.named_entities.

Great, interesting insights! The only concern is that it might be computationally expensive, especially when the initial DataFrame is large

aelnoshokaty commented 4 years ago

Very insightful Selim, Deep learning models could be a challenge for large datasets though, so a compromise could be made to the model complexity for applicability for big data.

jwabant commented 4 years ago

@selimelawwa Are you thinking of summarizing a complete series, or rather creating for example a column and making a summary of each item ? I think the second option would correspond more to use cases but I could be wrong.

For extractive approach, Gensim approach which uses BM-25 as a variation for the similarity function is interesting and easy to use https://radimrehurek.com/gensim/summarization/summariser.html For abstractive, depending on the use case the efficient methods will not be the same. Maybe pointer-generators would be a good compromise for TextHero even if it's not SOA : https://arxiv.org/pdf/1704.04368.pdf (see https://paperswithcode.com/paper/get-to-the-point-summarization-with-pointer for implementations). I still add more recent research if it can be useful : https://paperswithcode.com/paper/text-summarization-with-pretrained-encoders, https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html