google-research / pegasus

Apache License 2.0
1.61k stars 316 forks source link

Summarization code example? #13

Open youssefavx opened 4 years ago

youssefavx commented 4 years ago

Hi, I'm very curious about this model. I'd love to know how to generate summaries from it. A code snippet in a python script would be very helpful.

I'd like to input text and get an output summary out.

Also is it possible to specify the length of the summary?

purgenetik commented 4 years ago

++ It would be nice to add in readme how to build summary from given text using pretrained model.

JingqingZ commented 4 years ago

Hi, if you would like to run the PEGASUS model on the 12 existing datasets, on which PEGASUS has already been fine-tuned, please follow the README and run the model. If you would like to run the model on your customised textual data, you need to config the data as a new dataset, and then fine-tune on the pre-trained model checkpoints. These are already in README.

So far, there is no simple way for single text input and single summary output but we're trying to develop this feature. But note, if you have a new dataset, the fine-tuning is always necessary unless you would like to test zero-shot summarization (as we demonstrated in PEGASUS paper section 6.3).

The maximum length of summary can be specified in the decoding. You may control the length of summary in beam search by using length normalization.

JingqingZ commented 4 years ago

Hope the solution in this issue https://github.com/google-research/pegasus/issues/21 can help you create the tfrecord of your dataset and run PEGASUS.

nkathireshan commented 4 years ago

Hi please guide me on how to use collab to run this code or to do a local set up ? as I see it is mentioned to use GPU instance, and I don't have GPU credits to test the same? if it works well to my data set, then I am fine to buy credits and try it on GPU. Thank you very much.

JingqingZ commented 4 years ago

The local set up has been explained in README. GPU or TPU is not compulsory but highly highly highly recommended. Running on CPU can be very very very slow. Regarding colab, please refer to https://github.com/google-research/pegasus/issues/16.

peterjliu commented 4 years ago

You can try GPU/TPU on https://colab.research.google.com for free.

Saurabh1602 commented 4 years ago

Any idea how I could go about generating summary for a new dataset for zero-shot summarisation? That would be hugely useful if possible at this point using the current checkpoints and the code.

peterjliu commented 4 years ago

@Saurabh1602 that's an interesting question and is an active line of research that was out-of-scope for this project. It'd be a cool project to figure out how to do it with the existing checkpoints.

JingqingZ commented 4 years ago

A practical solution may need the creation of TFDS or TFRecords given the new dataset, and register a new param (as describe in README) and then run the code on pre-trained checkpoints. Hope this may help https://github.com/google-research/pegasus/issues/21#issuecomment-643333005.

TheRockXu commented 4 years ago

I got it to make novel predictions by instantiating the estimator object and create an input_fn.

import itertools
import os
import time

from absl import logging
from pegasus.data import infeed
from pegasus.params import all_params  # pylint: disable=unused-import
from pegasus.params import estimator_utils
from pegasus.params import registry
import tensorflow as tf
from pegasus.eval import text_eval
from pegasus.ops import public_parsing_ops

tf.enable_eager_execution()

master = ""
model_dir = "./ckpt/pegasus_ckpt/cnn_dailymail"
use_tpu = False
iterations_per_loop = 1000
num_shards = 1
param_overrides = "vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6"

eval_dir = os.path.dirname(model_dir)
checkpoint_path = model_dir
checkpoint_path = tf.train.latest_checkpoint(checkpoint_path )
params = registry.get_params('cnn_dailymail_transformer')(param_overrides)
pattern = params.dev_pattern
input_fn = infeed.get_input_fn(params.parser, pattern,
                                     tf.estimator.ModeKeys.PREDICT)
parser, shapes = params.parser(mode=tf.estimator.ModeKeys.PREDICT)

estimator = estimator_utils.create_estimator(master, 
                                             model_dir,
                                             use_tpu,
                                             iterations_per_loop,
                                             num_shards, params)

_SPM_VOCAB = 'ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model'
encoder = public_parsing_ops.create_text_encoder("sentencepiece", 
                                                     _SPM_VOCAB)

input_text = "Eighteen sailors were injured after an explosion and fire on board a ship at the US Naval Base in San Diego, US Navy officials said.The sailors on the USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital, Lt. Cmdr. Patricia Kreuzberger told CNN."
target = "18 sailors injured after an explosion and fire on a naval ship in San Diego"

def input_function(params):
    dataset = tf.data.Dataset.from_tensor_slices({"inputs":[input_text, input_text],"targets":[target, target]}).map(parser)
    dataset = dataset.unbatch()
    dataset = dataset.padded_batch(
        params["batch_size"],
        padded_shapes=shapes,
        drop_remainder=True)
    return dataset

predictions = estimator.predict(
          input_fn=input_function, checkpoint_path=checkpoint_path)

for i in predictions:
    print(text_eval.ids2str(encoder, i['outputs'], None))
    break

# Ouput - "The USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital ."
sumeet-iitg commented 4 years ago

I used the code mentioned in https://github.com/google-research/pegasus/issues/13#issuecomment-657167414 Unfortunately I get the following garbage output : leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop

TheRockXu commented 4 years ago

@sumeet-iitg First you need to train the cnn_dailymail model by running python3 pegasus/bin/train.py --params=cnn_dailymail_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \ --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \ --model_dir=ckpt/pegasus_ckpt/cnn_dailymail

TheRockXu commented 4 years ago

Hi, y'all.

I just created a repo that contains a trained pegasus servable model and a script which you can run the summarization end to end, like this

python test_example.py --article example_article --model_dir model/

Suppose your article is this one

You will this output - PREDICTION >> The hacking group known as NC29 is largely believed to operate as part of Russia's security services .<n>The three countries allege that it is carrying out a persistent and ongoing cyber campaign to steal intellectual property about a possible coronavirus vaccine .

sumeet-iitg commented 4 years ago

@TheRockXu, thank you for sharing your trained model Overall, I feel that the summarization is still more extractive than abstractive. Summarizing bigger text hides this fact, but for smaller text this problem is clear. Here are results from the above finetuned cnn_dailymail on a small input text:

INPUT>> 'A person is standing in front of a lake. The person appears to be John. The person is smiling.' PREDICTION >> A person is standing in front of a lake . The person appears to be John . Whereas an abstractive summarization should probably look something like: A person that appears to be John is standing in front of a lake. OR A person that appears to be John is smiling.

Perhaps @JingqingZ can also provide insights on overcoming such problems.

peterjliu commented 4 years ago

@sumeet-iitg two issues:

TheRockXu commented 4 years ago

Yeah, I also uploaded a model on gigaword, you can get from here.

It was trained overnight on Nvidia Quadro p6000. It seems to do the abstractive summary pretty well.

MarvinT commented 4 years ago

@TheRockXu , do you have a script for converting the checkpoints into the pb model files? Did you use your own fine tuned models or the checkpoints provided in the google cloud link?

TheRockXu commented 4 years ago

@MarvinT Yes, I do. I just uploaded an iPython script here to convert the checkpoints into the pb model. I didn't use any other datasets than what was given.

chetanambi commented 4 years ago

@TheRockXu Thanks for your Pegasus-demo code. I used this for generating abstractive summary using gigaword and results are really nice !! However, I can see that prediction size is limited to 32 characters. Is it possible to increase the prediction char size more than 32. I tried changing the parameters in test_example.py but prediction size is still limited to 32 chars only. Please let me know your thoughts.

TheRockXu commented 4 years ago

@chetanambi I think it is due to that training data. If you want longer abstractive summaries, you can probably add other training data to it.

chetanambi commented 4 years ago

I was going thru the Author's paper and as per the papers max limit is 32 for gigaword dataset. Thanks again for your wonderful work on creating pegasus-demo code. That was really very helpful.

chetanambi commented 4 years ago

Yeah, I also uploaded a model on gigaword, you can get from here.

It was trained overnight on Nvidia Quadro p6000. It seems to do the abstractive summary pretty well.

@TheRockXu Could you please let me know the steps you followed to fine-tune gigaword. I would like to try summary results on reddit dataset,

TheRockXu commented 4 years ago

@chetanambi All I did was to decrease the batch size by half. I let it trained for two days.

nliu86 commented 4 years ago

@TheRockXu I'm curious why you fine-tune gigaword. Why don't you just use the gigaword model finetuned by the author?

chetanambi commented 4 years ago

@TheRockXu Yes, I also have same question as @nliu86. Why didn't you just use the gigaword model finetuned by the author?

TheRockXu commented 4 years ago

@nliu86 Oh, just to avoid memory error.

beni1864 commented 4 years ago

@TheRockXu I am trying to construct an instance of this model to use for drug-related clinical research articles (guessing Pubmed is the best choice?) I have little experience with Python and I would appreciate if you could help me get started, if you have the time. Thanks!

TheRockXu commented 4 years ago

@beni1864 email me at aitroopers@gmail.com

thangarani commented 4 years ago

Hi, if you would like to run the PEGASUS model on the 12 existing datasets, on which PEGASUS has already been fine-tuned, please follow the README and run the model. If you would like to run the model on your customised textual data, you need to config the data as a new dataset, and then fine-tune on the pre-trained model checkpoints. These are already in README.

So far, there is no simple way for single text input and single summary output but we're trying to develop this feature. But note, if you have a new dataset, the fine-tuning is always necessary unless you would like to test zero-shot summarization (as we demonstrated in PEGASUS paper section 6.3).

The maximum length of summary can be specified in the decoding. You may control the length of summary in beam search by using length normalization.

@JingqingZ Can you please let me know clearly, how to change the length of the summary in detail?

JingqingZ commented 4 years ago

Set a different max_output_len or choose a different beam_alpha to encourage longer/shorter summaries.

These params are defined in https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

thangarani commented 4 years ago

Yeah, I also uploaded a model on gigaword, you can get from here.

It was trained overnight on Nvidia Quadro p6000. It seems to do the abstractive summary pretty well.

Yes, I implemented your pegasus demo on gigaword. Its working great. And the prediction: three countries allege thatCozy Bear is trying to steal vaccine research.

I need to compare it with the BERT. The one of the way to compare is by have same number of words in prediction. Is that possible to increase the length of the prediction for the model gigaword in pegasus-demo?

TheRockXu commented 4 years ago

@thangarani I doubt it. You'd have to pick a different training dataset to train a new model, like reddit.

JingqingZ commented 4 years ago

Pegasus is supported by HuggingFace now https://huggingface.co/models?filter=pegasus.

beni1864 commented 4 years ago

Hi JingqingZ,

The following error occurs when I press "compute" on each model in the huggingface repository:

⚠️ Can't load config for 'google/pegasus-wikihow'. Make sure that: - 'google/pegasus-wikihow' is a correct model identifier listed on 'https://huggingface.co/models' - or 'google/pegasus-wikihow' is the correct path to a directory containing a config.json file

JingqingZ commented 4 years ago

Hi JingqingZ,

The following error occurs when I press "compute" on each model in the huggingface repository:

⚠️ Can't load config for 'google/pegasus-wikihow'. Make sure that: - 'google/pegasus-wikihow' is a correct model identifier listed on 'https://huggingface.co/models' - or 'google/pegasus-wikihow' is the correct path to a directory containing a config.json file

Hi, please report the error to the team of HuggingFace if the problem persists.

Guru4741 commented 4 years ago

Can Pegasus Model be used to get summary from a url instead of text input ? What processing needs to be done on the url content to feed it to the Pegasus Model ? Can anybody help please.

Patrick-Old commented 4 years ago

Hey, for those who are looking for the simplest way to run Pegasus for summarization, I highly recommend checking out Hugging Face (as @JingqingZ recommended) above. Here is the link. The code is as simple as this (assuming you are trying to run the xsum version of pegasus (I recommend checking out some others as well).


import torch
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
bui-thanh-lam commented 3 years ago

I was trying fine-tune on colab with batch_size = 2, max_input_length = 512 and it almost ran out of VRAM. Also, it took very long time (20000 examples, 2h/epoch, bs = 2). Why does it take a lot of time and memory? How do you guys set the params? Thanks all!

sulata2 commented 3 years ago

I got it to make novel predictions by instantiating the estimator object and create an input_fn.

import itertools
import os
import time

from absl import logging
from pegasus.data import infeed
from pegasus.params import all_params  # pylint: disable=unused-import
from pegasus.params import estimator_utils
from pegasus.params import registry
import tensorflow as tf
from pegasus.eval import text_eval
from pegasus.ops import public_parsing_ops

tf.enable_eager_execution()

master = ""
model_dir = "./ckpt/pegasus_ckpt/cnn_dailymail"
use_tpu = False
iterations_per_loop = 1000
num_shards = 1
param_overrides = "vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6"

eval_dir = os.path.dirname(model_dir)
checkpoint_path = model_dir
checkpoint_path = tf.train.latest_checkpoint(checkpoint_path )
params = registry.get_params('cnn_dailymail_transformer')(param_overrides)
pattern = params.dev_pattern
input_fn = infeed.get_input_fn(params.parser, pattern,
                                     tf.estimator.ModeKeys.PREDICT)
parser, shapes = params.parser(mode=tf.estimator.ModeKeys.PREDICT)

estimator = estimator_utils.create_estimator(master, 
                                             model_dir,
                                             use_tpu,
                                             iterations_per_loop,
                                             num_shards, params)

_SPM_VOCAB = 'ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model'
encoder = public_parsing_ops.create_text_encoder("sentencepiece", 
                                                     _SPM_VOCAB)

input_text = "Eighteen sailors were injured after an explosion and fire on board a ship at the US Naval Base in San Diego, US Navy officials said.The sailors on the USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital, Lt. Cmdr. Patricia Kreuzberger told CNN."
target = "18 sailors injured after an explosion and fire on a naval ship in San Diego"

def input_function(params):
    dataset = tf.data.Dataset.from_tensor_slices({"inputs":[input_text, input_text],"targets":[target, target]}).map(parser)
    dataset = dataset.unbatch()
    dataset = dataset.padded_batch(
        params["batch_size"],
        padded_shapes=shapes,
        drop_remainder=True)
    return dataset

predictions = estimator.predict(
          input_fn=input_function, checkpoint_path=checkpoint_path)

for i in predictions:
    print(text_eval.ids2str(encoder, i['outputs'], None))
    break

# Ouput - "The USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital ."

HI .. I tried using your demo code , but i am getting the below error . could you please help me in this. !python3 pegasus/bin/train.py --params=aeslc_transformer --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 --model_dir=ckpt/pegasus_ckpt/aeslc

Traceback (most recent call last): File "pegasus/bin/train.py", line 17, in from pegasus.data import infeed File "/usr/local/lib/python3.7/dist-packages/pegasus/init.py", line 1, in from pegasus.parser import * File "/usr/local/lib/python3.7/dist-packages/pegasus/parser.py", line 10, in from pegasus.rules import _build_rule, ParseError, Lazy File "/usr/local/lib/python3.7/dist-packages/pegasus/rules.py", line 62 print 'pegasus: {}\x1b[2;38;5;241menter {} -> {}\x1b[m'.format(depth, repr(char()), _name) ^ SyntaxError: invalid syntax

christophschuhmann commented 3 years ago

I tried to get it to run on Colab, but it gets a weird error: https://colab.research.google.com/drive/1sMVvIhZExRYJqFBsPhO28InJ68502dJj?usp=sharing

Cn anyone here fix it?

pumuckelo commented 3 years ago

This article is also great, it uses huggingface and you can try out different models https://towardsdatascience.com/abstractive-summarization-using-pytorch-f5063e67510

prinky12 commented 2 years ago

@sumeet-iitg
It's too late, but I'm leaving a comment in the hopes that it will help someone else. Need to fix: sentencepiece >>sentencepiece_newline ex) encoder = public_parsing_ops.create_text_encoder("sentencepiece_newline", _SPM_VOCAB)

jayasridharmireddi commented 9 months ago

@sumeet-iitg First you need to train the cnn_dailymail model by running python3 pegasus/bin/train.py --params=cnn_dailymail_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \ --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \ --model_dir=ckpt/pegasus_ckpt/cnn_dailymail

Hello, when I do this I am getting a checksum error. Can you please look into this:

_raise NonMatchingChecksumError(resource.url, tmp_path) tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ, downloaded to /home/tbvl/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfTHk4NFg2SndKG8BdJPpt2iRo6Dpzz23CByJuAePEilB-pxbcBCHaWDs.tmp.4214ed267e0b4cca80a05b4fd69eaa5c/download, has wrong checksum. I0224 22:47:57.809536 140247509751552 download_manager.py:273] Skipping extraction for /home/tbvl/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_apc7knzpshiwmzikwgjbSqZYlq2yGpDviLVIGsnkNgCk.txt (method=NO_EXTRACT)._