google-research / pegasus

Apache License 2.0
1.61k stars 316 forks source link

How to generate abstractive summary #63

Closed chetanambi closed 4 years ago

chetanambi commented 4 years ago

Hi - I was able to generate extractive summary by referring this link. But I am stuck on how to generate abstractive summary. I would like to try this on sample text in the below Colab notebook and then will work on adding separate dataset.

https://colab.research.google.com/drive/1vRfhz_arrgnmbnLSr3c7YTGJfBH6TvGE?usp=sharing

Please do suggest how to get abstractive summary.

JingqingZ commented 4 years ago

Hi, my observation is the model can generate more abstractive summaries if (1) the downstream dataset is abstractive (like xsum) (2) and the model is fine-tuned sufficiently on the downstream dataset.

chetanambi commented 4 years ago

Hi, I would like to know the steps involved to fine tune on xsum and then predict on test sample. I looked into all open & closed issue but could not find any sample code for generating abstractive summary using Pegasus.

JingqingZ commented 4 years ago

The instruction to run fine-tuning is provided in README. If you're interested in xsum, you may replace aeslc to xsum in the command.

chetanambi commented 4 years ago

I tried to run by changing aeslcto xsumbut running into below error in Colab. It seems some manual effort needed here. Could you please check.

ERROR:tensorflow:Error recorded from training_loop: Manual directory /root/tensorflow_datasets/downloads/manual does not exist or is empty. Create it and download/extract dataset artifacts in there. Additional instructions: Detailed download instructions (which require running a custom script) are
here:
https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/README.md#running-the-download-and-extraction-scripts
Afterwards, please put xsum-extracts-from-downloads.tar.gz file in the manual_dir.
E0716 12:52:09.644410 140052921485184 error_handling.py:75] Error recorded from training_loop: Manual directory /root/tensorflow_datasets/downloads/manual does not exist or is empty. Create it and download/extract dataset artifacts in there. Additional instructions: Detailed download instructions (which require running a custom script) are
here:
https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/README.md#running-the-download-and-extraction-scripts
Afterwards, please put xsum-extracts-from-downloads.tar.gz file in the manual_dir.
INFO:tensorflow:training_loop marked as finished
I0716 12:52:09.644734 140052921485184 error_handling.py:101] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0716 12:52:09.644894 140052921485184 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "pegasus/bin/train.py", line 94, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "pegasus/bin/train.py", line 89, in main
    max_steps=train_steps)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    input_fn, ModeKeys.TRAIN))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1025, in _get_features_and_labels_from_input_fn
    self._call_input_fn(input_fn, mode))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2987, in _call_input_fn
    return input_fn(**kwargs)
  File "/content/pegasus/pegasus/data/infeed.py", line 41, in input_fn
    dataset = all_datasets.get_dataset(input_pattern, training)
  File "/content/pegasus/pegasus/data/all_datasets.py", line 52, in get_dataset
    dataset, _ = builder.build(input_pattern, shuffle_files)
  File "/content/pegasus/pegasus/data/datasets.py", line 200, in build
    dataset, num_examples = self.load(build_name, split, shuffle_files)
  File "/content/pegasus/pegasus/data/datasets.py", line 158, in load
    data_dir=self.data_dir)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/registered.py", line 371, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 376, in download_and_prepare
    download_config=download_config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1019, in _download_and_prepare
    max_examples_per_split=download_config.max_examples_per_split,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 939, in _download_and_prepare
    dl_manager, **split_generators_kwargs):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/summarization/xsum.py", line 101, in _split_generators
    os.path.join(dl_manager.manual_dir, folder_name + ".tar.gz")),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/download/download_manager.py", line 619, in manual_dir
    self._manual_dir, self._manual_dir_instructions))
AssertionError: Manual directory /root/tensorflow_datasets/downloads/manual does not exist or is empty. Create it and download/extract dataset artifacts in there. Additional instructions: Detailed download instructions (which require running a custom script) are
here:
https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/README.md#running-the-download-and-extraction-scripts
Afterwards, please put xsum-extracts-from-downloads.tar.gz file in the manual_dir.
JingqingZ commented 4 years ago

Yes, xsum requires manual download. https://www.tensorflow.org/datasets/catalog/xsum

chetanambi commented 4 years ago

I have manually uploaded xsum file to folder /root/tensorflow_datasets/downloads/manual but now running into below error. Is there anything else need to be taken care. Note that I am following all the steps mentioned in README.

I have downloaded xsum from here: http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

W0716 16:39:40.755009 139784552892288 xsum.py:156] 204045 out of 204045 examples are missing.
Shuffling and writing examples to /root/tensorflow_datasets/xsum/1.1.0.incompleteZN2CXH/xsum-train.tfrecord
ERROR:tensorflow:Error recorded from training_loop: No examples were yielded.
E0716 16:39:40.756713 139784552892288 error_handling.py:75] Error recorded from training_loop: No examples were yielded.
INFO:tensorflow:training_loop marked as finished
I0716 16:39:40.756896 139784552892288 error_handling.py:101] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0716 16:39:40.757050 139784552892288 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "pegasus/bin/train.py", line 94, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "pegasus/bin/train.py", line 89, in main
    max_steps=train_steps)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    input_fn, ModeKeys.TRAIN))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1025, in _get_features_and_labels_from_input_fn
    self._call_input_fn(input_fn, mode))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2987, in _call_input_fn
    return input_fn(**kwargs)
  File "/content/pegasus/pegasus/data/infeed.py", line 41, in input_fn
    dataset = all_datasets.get_dataset(input_pattern, training)
  File "/content/pegasus/pegasus/data/all_datasets.py", line 52, in get_dataset
    dataset, _ = builder.build(input_pattern, shuffle_files)
  File "/content/pegasus/pegasus/data/datasets.py", line 200, in build
    dataset, num_examples = self.load(build_name, split, shuffle_files)
  File "/content/pegasus/pegasus/data/datasets.py", line 158, in load
    data_dir=self.data_dir)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/registered.py", line 371, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 376, in download_and_prepare
    download_config=download_config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1019, in _download_and_prepare
    max_examples_per_split=download_config.max_examples_per_split,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 951, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1037, in _prepare_split
    shard_lengths, total_size = writer.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/tfrecords_writer.py", line 213, in finalize
    self._shuffler.bucket_lengths, self._path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/tfrecords_writer.py", line 97, in _get_shard_specs
    shard_boundaries = _get_shard_boundaries(num_examples, num_shards)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/tfrecords_writer.py", line 120, in _get_shard_boundaries
    raise AssertionError("No examples were yielded.")
AssertionError: No examples were yielded.
JingqingZ commented 4 years ago

204045 out of 204045 examples are missing.

Hi it seems the xsum hasn't been successfully stored/downloaded in your env.

chetanambi commented 4 years ago

Hi, xsum seems to be running into lot of issue so I am trying this for another dataset reddit-tifu. This seems to be working fine when I execute below 2 commands.

This code is working fine without any issue.

!python3 pegasus/bin/train.py --params=reddit_tifu_long_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/reddit_tifu

This code seems to be running forever. I can see that this is generating 3 files input, target & predictions files in model directory.

!python3 pegasus/bin/evaluate.py --params=reddit_tifu_long_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/reddit_tifu

I would like to know what changes required in order to generate abstractive summary for custom data /sample data. Is there any changes needed in public_params.py. Please do suggest.

JingqingZ commented 4 years ago

The testing (decoding) can take hours. You will see the update inside those three files when new samples are processed.

The model fine-tuned on reddit (or xsum) is more abstractive than others. The abstractiveness of generated summaries is somehow determined by the abstractiveness of the dataset that the model is fine-tuned on.

chetanambi commented 4 years ago

@JingqingZ Really appreciate your quick responses on this thread.

Yes, like you mentioned its taking long time to decode. I followed exactly same steps mentioned in the README for generating abstractive summary. Below is an one of the sample during evaluation phase. But this seems to be generating extractive summary instead of abstractive. So would like to know what changes required in order to generate abstractive summary. Is there any changes needed in public_params.py. Please do suggest.

INPUTS: this fu happened back in the school year of 96-97, when i was in 8th grade and just weeks shy of graduating. the setting was our yearly field day in a local park. i was captain of my team, but couldn't really care less so i ditched some of the events and went to the lavatory to have a quick cigarette with my friend, kate. cig was smoked so we headed back to to activities. hours passed and we were now back at our school for a pep rally in the gym area. kate and i were called to the principals office. at the time i had no idea i was in trouble until i saw sister rose's face. she was pissed. apparently one of the parents witnessed me and kate go into a stall to smoke. we were busted. no amount of denying would let us off the hook. as punishment we would serve an in-school suspension which would require us to clean the convent (catholic school). we were to wear our school gym clothes for this day which were essentially sweats. this is important later. the day arrived for our punishment and we were not thrilled. we were each given a list of things to do and we were split up. lame. i can't remember who saw it first, but one of us discovered a cell phone in all of its zack morris glory and decided it would be a good idea to steal it. the phone was huge and difficult to conceal in our gym clothing, but we had to have it. in our 14 year old brains we though "who would ever know?". we decided to each take a few of the items: phone, charging base, cords, etc. and then we would meet up later and play with our new found gadget. my heart was racing. i was so nervous and excited. later that day, we met up with our share of the goods. the phone had a 4 digit lock on it. dammit. but we were determined. a bunch of nuns couldn't hold us back! we guessed for what felt like hours. we never did unlock that damn phone. kate mentioned to me that she believes people can track the phone to our location. so what did we do next? the most logical thing. we throw the phone in the woods and the accessories down "the ditch" which was a sewage ditch overgrown with weeds. here is the "ditch" at the bottom of the image.

TARGETS: got caught smoking in 8th grade, found a cell phone while serving our punishment. the phone down the ditch to hide the evidence.

PREDICTIONS: as punishment we would serve an in-school suspension which would require us to clean the convent (catholic school). i can't remember who saw it first, but one of us discovered a cell phone in all of its zack morris glory and decided it would be a good idea to steal it. the school wanted the phone back or they would be given the dreaded call from school.

JingqingZ commented 4 years ago

Please refer to the answers I made above. You can either (1) run the model which already fine-tuned on xsum or reddit (list of hyper-parameters are provided in appendix C of our paper); (2) or fine-tune on your own dataset.

chetanambi commented 4 years ago

I am closing this issue as I was able to generate abstractive summary by following the steps mentioned in issue#13

chetanambi commented 4 years ago

@JingqingZ Just one last question wrt fine-tuning. I believe the models stored in the directory of pegasus_ckpt/arxiv or pegasus_ckpt/xsum or pegasus_ckpt/gigaword etc. already fine-tuned. So why do we need to fine-tune again in order to make abstractive summary?

JingqingZ commented 4 years ago

Yes, those models are fine-tuned on the corresponding datasets. In case you're interested in fine-tuning another model on your own dataset, which may be abstractive and quite different from xsum, gigaword or arxiv, it is also doable.

chetanambi commented 4 years ago

Thanks for the clarification. I am closing this.