huggingface / transformers

šŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.17k stars 26.59k forks source link

Summarization pipeline max_length parameter seems to just cut the summary rather than generating a complete sentence within the max length #3579

Closed Weilin37 closed 4 years ago

Weilin37 commented 4 years ago

šŸ› Bug

Information

Model I am using (Bert, XLNet ...): default model from pipeline("summarization")

Language I am using the model on (English, Chinese ...): English

I am using the pipeline for summarization in most up to date version of Transformers. I am inputing a long piece of tax and setting the summarizer to be: summarizer(PIECE_OF_TEXT, max_length = 50).

I was expecting the summarizer to generate a summary within 50 words but it seems to only generate a summary that seems cut off (the ending of the summary ends with a comma and doesn't end in a grammatical sensible way. See example below.

The piece of text to be summarized: Renal-cell carcinoma is characterized by susceptibility to both immunotherapeutic and antiangiogenic treatment approaches and resistance to cytotoxic chemotherapy.1 Agents such as sunitinib that target the vascular endothelial growth factor (VEGF) pathway are standard first-line therapy for advanced disease.2-7 Despite the approval of several targeted therapies by entities such as the Food and Drug Administration, the European Medicines Agency, and the Pharmaceuticals and Medical Devices Agency, the survival rate among patients with metastatic renal-cell carcinoma has plateaued.

Both the VEGF receptor tyrosine kinase inhibitor axitinib and the antiā€“programmed death 1 (PD-1) monoclonal antibody pembrolizumab have shown antitumor activity in patients with previously untreated advanced clear-cell renal-cell carcinoma.6,10 In a phase 1b trial involving patients with previously untreated metastatic renal-cell carcinoma, 73% (95% confidence interval [CI], 59 to 84) of the patients who received pembrolizumab plus axitinib had a response; 65% of patients had at least one treatment-related adverse event.11 We conducted the KEYNOTE-426 trial to determine whether pembrolizumab plus axitinib would result in better outcomes than sunitinib in patients with previously untreated advanced renal-cell carcinoma.

And the summary: Renal-cell carcinoma is characterized by susceptibility to both immunotherapeutic and antiangiogenic treatment approaches. Agents such as sunitinib that target the vascular endothelial growth factor (VEGF) pathway are standard first, axitinib and the antiā€“programmed death 1 (PD-1) monoclonal antibody pembrolizumab have shown antitumor activity in patients with previously untreated advanced clear-cell renal-cell carcin,

aychang95 commented 4 years ago

Try using the T5 summarizer instead like below:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
inputs = tokenizer.batch_encode_plus(["summarize: " + example_text], max_length=1024, return_tensors="pt", pad_to_max_length=True)  # Batch size 1
outputs = model.generate(inputs['input_ids'], num_beams=4, max_length=50, early_stopping=True)

print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in outputs])

The above excerpt gave me a summary of:

'the survival rate among patients with metastatic renal-cell carcinoma has plateaued . agents such as sunitinib that target the vascular endothelial growth factor pathway are standard first-line therapy for advanced disease'

If you still want to use Bart:

My assumption is that this is not a bug. I may be wrong, but it seems the Bart summarizer just has a bias towards pointing to the first couple sentences of the original text. It's still abstractive, as can be seen by subtle differences in the summary you're getting. If you specify min_length as a higher value, like 100, you start to see that there are pointers to sentences that are not just in the first couple sentences.

Trying a min_length of a 100 using bart-large-cnn gave me the below summary:

'Renal-cell carcinoma is characterized by susceptibility to both immunotherapeutic and antiangiogenic treatment approaches and resistance to cytotoxic chemotherapy. Agents such as sunitinib that target the vascular endothelial growth factor (VEGF) pathway are standard first-line therapy for advanced disease. We conducted the KEYNOTE-426 trial to determine whether pembrolizumab plus axit inib would result in better outcomes than sunit in patients with previously untreated advanced renal- cell carcinoma.'`

You can see that the last sentence is not a part of the initial text excerpt

patrickvonplaten commented 4 years ago

As @aychang95 suggested you have to play around with the generate method arguments to see what works best for your example. Especially take a look at num_beams, max_length, min_length, early_stopping and length_penalty.

I just noticed that I forget to add a good default setting to the Bart summarization pipeline. Just uploaded it - see here: https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json

The summarization pipeline should work better now :-)

Weilin37 commented 4 years ago

As @aychang95 suggested you have to play around with the generate method arguments to see what works best for your example. Especially take a look at num_beams, max_length, min_length, early_stopping and length_penalty.

I just noticed that I forget to add a good default setting to the Bart summarization pipeline. Just uploaded it - see here: https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json

The summarization pipeline should work better now :-)

Thank you! How do I go about updating the model? My code is below but I receive an error:

from transformers import pipeline, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModel.from_pretrained("facebook/bart-large-cnn")
summarizer = pipeline("summarization",  model = model, tokenizer = tokenizer)

OSError: Model name 'facebook/bart-large-cnn' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed 'facebook/bart-large-cnn' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

patrickvonplaten commented 4 years ago
from transformers import pipeline, AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("bart-large-cnn")
model = AutoModelWithLMHead.from_pretrained("bart-large-cnn")
summarizer = pipeline("summarization",  model = model, tokenizer = tokenizer)

works :-).

Note that "bart-large-cnn" is the default model for the summarization pipeline. The code above is equivalent to:

from transformers import pipeline
summarizer = pipeline("summarization")
Weilin37 commented 4 years ago

I was also able to discover another reason of why the summarization cut off. I believe setting the max_length conflicted with whatever the default min_length was. It looks like max_length takes priority and so the summary was cut off. I think it would be useful if this was managed automatically somehow, or at least display a warning.

girijesh97 commented 4 years ago

Hi @patrickvonplaten I just found that summarization takes 1024 words into consideration for generating a summary on its default parameters. I would like to know if I can increase the input size in order to consider more words while generating a summary in any case. I got the following message.

Your max_length is set to 1300, but you input_length is only 1024. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)

patrickvonplaten commented 4 years ago

As far as I know for Bart the max_length is 1024 and for T5 it's 512. So depending on your model, I don't think you can increase the max_length higher than its max_length value.

girijesh97 commented 4 years ago

@patrickvonplaten I got your point. I have another question, what is the maximum token ( or words ) we can provide to Bart for a summary generation. Also, what should I do in order to generate a summary from a large text which contains approximately 100k words in it?

patrickvonplaten commented 4 years ago

A text that contains 100k words is probably more of a novel than a "text" :D. So for these kinds of text using Bart you would need to chunk the text. Your memory would explode anyways at such sizes. In a couple of days we will add Reformer which can handle super long input text. We will soon also have an encoder-decoder model for Reformer which you could then use for summarization.