huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.69k stars 26.22k forks source link

Pegasus Xsum Returning Tokens Not In Source Text #8685

Closed 1337-Pete closed 3 years ago

1337-Pete commented 3 years ago

I'm currently using sshleifer/distill-pegasus-xsum-16-8 model to perform abstractive text summarization, I've found this particular model to be most useful for my desired application. However, when attempting to summarize on inputted source text, the output returns tokens returned are nowhere in the source text. I suspect Pegasus is returning tokens from the dataset that it was trained. That said, is finetuning needed? Should hyperparameter tweaking solve this?

I wonder if PEGASUS + GAN could help teach the model to abstract from tokens in the input text?

Here's an example

Source Text: German shares suffered their weakest day since early June on Wednesday as the government agreed on an emergency lockdown to combat surging COVID-19 cases, with other European markets following suit on fears of more curbs around the continent. The German DAX sank as much as 5% before cutting some losses to close down 4.2% at its lowest in five months. The precise measures were still subject to negotiation, with sources saying the government had agreed to shut bars and restaurants from Nov. 2. The pan-European STOXX 600 index fell 3% in its sharpest one-day drop in five weeks. France's main index dropped 3.4% ahead of a televised address by President Emmanuel Macron at 8:00 pm when he is expected to issue stay-at-home orders.

# XSUM 16-8
model_name = "sshleifer/distill-pegasus-xsum-16-8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_pegasus_distill_xsum_16_8 = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch([src_text], truncation=True, padding='longest').to(torch_device)
translated = model_pegasus_distill_xsum_16_8.generate(**batch,num_beams=9, num_return_sequences=3, temperature=1, length_penalty=5, max_length = 256, min_length=0)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)`

Output Text: Shares in Europe have fallen sharply after the German government agreed to shut down bars and restaurants in a bid to curb the spread of carbon monoxide (CO) in the country's capital, Berlin. The pan-European STOXX 600 index fell 3% in its sharpest one-day drop in five weeks, while the FTSE 100 index closed down 3.7% in its sharpest one-day fall in five weeks.

From the outputted text, one can see that nowhere in the input text was carbon monoxide (CO) or Berlin or FTSE 100 mentioned.

LysandreJik commented 3 years ago

Not an expert in summarization, but abstractive text summarization does not extract sequences/tokens from the initial text to produce a summary. That would be extractive text summarization. Abstractive text summarization instead can be done with rephrasing, as it seems to be the case here.

On a second note, I believe the Pegasus checkpoints were trained on very long sequences, so I'm not entirely sure how it would deal with smaller sequences as the one you used here.

On a third note, we try to keep the github issues reserved for issues/feature requests; you would have more luck asking this over on the forum.

@patrickvonplaten or @patil-suraj can chime in if I'm wrong.

patrickvonplaten commented 3 years ago

The hyperparameters seem very extreme to me... also temperature=1 does not do anything and length_penalty=5 is very high - also note that a length_penalty > 1 actually incentivizes longer sequences. @sshleifer 's model already has good hyper-parameters set as default values that you can see here: https://huggingface.co/sshleifer/distill-pegasus-xsum-16-8/blob/main/config.json

If you just use those, e.g.:

translated = model_pegasus_distill_xsum_16_8.generate(**batch)

you get this summary:

European shares fell sharply on Wednesday as investors remained cautious ahead of a speech by France's president later in the day.

You can try it yourself here: https://huggingface.co/sshleifer/distill-pegasus-xsum-16-8?text=German+shares+suffered+their+weakest+day+since+early+June+on+Wednesday+as+the+government+agreed+on+an+emergency+lockdown+to+combat+surging+COVID-19+cases%2C+with+other+European+markets+following+suit+on+fears+of+more+curbs+around+the+continent.+The+German+DAX+sank+as+much+as+5%25+before+cutting+some+losses+to+close+down+4.2%25+at+its+lowest+in+five+months.+The+precise+measures+were+still+subject+to+negotiation%2C+with+sources+saying+the+government+had+agreed+to+shut+bars+and+restaurants+from+Nov.+2.+The+pan-European+STOXX+600+index+fell+3%25+in+its+sharpest+one-day+drop+in+five+weeks.+France%27s+main+index+dropped+3.4%25+ahead+of+a+televised+address+by+President+Emmanuel+Macron+at+8%3A00+pm+when+he+is+expected+to+issue+stay-at-home+orders.



My conclusion would  be that it's just the hyperparameters that are badly chosen - not sure if @sshleifer has something to add...
sshleifer commented 3 years ago
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.