huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.05k stars 26.3k forks source link

Extractive Text Summarization #4332

Closed timsuchanek closed 4 years ago

timsuchanek commented 4 years ago

🚀 Feature request

While the abstractive text summarization with T5 and Bart already achieve impressive results, it would be great to add support for state-of-the-art extractive text summarization, such as the recent MatchSum which outperforms PreSum by a significant margin.

Motivation

The Bart-based summarization is already pretty awesome. However, I recently got this summary with Bart from a Bill Gates article:

"We are seeing more and more people get sick and dying, and this is a good thing, but it also means that we have less time to prepare for the future."

It seems to me, that the extractive methods are still "less risky" while they can also achieve great results.

So adding an easy way to access one of the extractive methods, for example the new MatchSum algorithm, which also now released the pre-trained models for CNN/DM, would be really awesome!

Laksh1997 commented 4 years ago

Are you sampling tokens? If so, turn it off and maybe also turn up the beam size. That will give more extractive outputs

timsuchanek commented 4 years ago

Thanks @Laksh1997 for the idea, unfortunately, the results don't get better. This is the article I'm summarizing: https://www.gatesnotes.com/Health/Pandemic-Innovation

Result with min_length=500, max_length=1000 and default settings of the summarizer pipeline:

This is the first part of a two-part series on the impact of global warming. The second part of the series will look at ways to reduce the effects of climate change. The third part will focus on ways to prevent the spread of the disease. The fourth part will be on how we can make sure we don't see a repeat of what happened in the 1980s and 1990s. It will also look at how to make sure that we don’t see an increase in the number of people who need to be treated for the disease every time it rears its head. The final part of this series looks at ways we can reduce the impact on the economy of the global warming crisis by reducing the amount of money spent on health care. It is also a look at some of the ways in which we can prevent the disease from getting worse, such as by making sure we have better access to the right equipment and training for doctors and nurses. The last part will look back at how we were able to stop the disease’s spread in the first place, and how we’ve been able to do so since then. It’ll be interesting to see how we respond to the current crisis, which has caused a lot of people to lose their jobs and homes, as well as the loss of health care and the cost of living that has gone up by a third since the beginning of the year. We’re in the midst of a global pandemic, but we have a long way to go before we see the full extent of the damage caused by climate change, which is likely to be much worse in the coming years. We also need to look at what we can do to prevent it from happening in the future, including ways to make it easier for people to get the care they need. We need to make the most of the time we have left before it gets worse, and we need to do it in a way that makes it easier to get to the bottom of the problem. We can do this by focusing on what we are doing now, rather than focusing on the causes of the illness, which can be hard to come by in a small number of cases. We should also be looking for ways to keep the disease at a low level so that it doesn't spread as far and as fast as possible. We are seeing more and more people get sick and dying, and this is a good thing, but it also means that we have less time to prepare for the future.

Result with num_beams=8 and do_sample=False:

This is the first part of a two-part series on the impact of climate change on the U.S. and the world. The second part of the series will look at ways to reduce the effects of global warming. The third part will focus on how we can reduce the number of people affected by climate change. The fourth and final part will be a look at some of the ways we can make sure we don't suffer the same fate as those who have been affected by the climate change pandemics of the past few years. It will be the first of a series of articles on the topic, and will be followed by a series on climate change in the next few months. For more information, go to: http://www.cnn.com/2013/01/29/climate-change/index.html#storylink=cpy, and for more information on the Global Warming Program, visit: http://www.climatechange.org/2013-01-29/global-warming-program/. For more on the World Health Organization (WHO), go to www.welcome.org/. For information on how to get involved in the fight against climate change, visit the WHO’s website. For information about how to help people in need of financial assistance, visit www.worldhealth.org. For confidential support, call the Samaritans on 08457 90 90 90 or visit a local Samaritans branch, see www.samaritans.org for details. For support on suicide matters call the National Suicide Prevention Lifeline on 1-800-273-TALK (8255). For support in the UK, visit The Samaritans’ local branch, or click here. For help in the United States, see the National Institutes of Health (NHS), which has a range of programs that can help people cope with the changing nature of the threat to their health, such as the threat of pneumococcal meningitis, sepsis, stroke, and other ailments. For all the information you need, visit http:www.nhs.uk/news/publications/cnn/2014/07/09/world-health-paediatric-pneumonia-and-sickness-in-the-middle-of-a-drought.html. For the full series, see:http:/ / www.nhc.gov/newspeak/stories/2014-09-09/the-world-succeeding-against-climate-changes.html?title=World-warming-disease-infiltrating-crisis-initiative.

I have no idea where it gets the idea of climate change from :D

Laksh1997 commented 4 years ago

@timsuchanek Note that transformer models can only consider context of up to N number of subtokens (in the case of BART I think N = 1024).

So, if the input context (the long document) is greater than this, it will be truncated to 1024 subtokens.

This means if you ask the decoder to generate more than what it can consider in context, it will at best copy the context, and at worse start to make up stuff.

I'm not sure if min_length and max_length refer to subtokens or tokens in the huggingface implementation.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.