pkouris / abtextsum

Abstractive text summarization based on deep learning and semantic content generalization
17 stars 4 forks source link
text-summarization

abtextsum

This source code has been used in the experimental procedure of the following paper:

Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis. 2019. Abstractive text summarization based on deep learning and semantic content generalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5082-5092.

This paper is accessible in the Proceedings of the 57th ACL Annual Meeting (2019) or directly from here.


For citing, the BibTex follows:

@inproceedings{kouris2019abstractive,
  title={Abstractive text summarization based on deep learning and semantic content generalization},
  author={Kouris, Panagiotis and Alexandridis, Georgios and Stafylopatis, Andreas},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  month = jul,
  year={2019}
  address = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  url = {https://www.aclweb.org/anthology/P19-1501},
  pages={5082--5092},
}



Code Description

The code described below follows the methodology and the assumptions which are described in detail in the aforementioned paper. The experimental procedure, as it is described in the paper, requires as initial dataset for training, validation and testing the Gigaword dataset as it is described by Rush et. al. 2015 (see references in the paper). Also for testing, the DUC 2004 dataset is used as this is also described in the paper.
According to the paper, the initial dataset is preprocessed furthermore and generalized to one of the proposed text generalization strategies (e.g. NEG100 or LG200d5). Then the generalized dataset is used for training where the deep learning model learns to predict a generalized summary.
In the phase of testing, a generalized article (e.g. an article of the test set) is given as input to the deep learning model which predicts the respective generalized summary. Then, in the phase of post-processing, the generalized concepts of the generalized summary are replaced by the specific concepts of the original (preprocessed) article producing the final summary.

The workflow of this framework follows:

  1. Preprocessing of the dataset
    The task of preprocessing of the dataset is performed by DataPreprocessing class (preprocessing.py file). The method _cleandataset() is used for preprocessing the Gigaword dataset while the method _clean_duc_dataset_from_original_tocleaned() is used for DUC dataset.

  2. Text generalization
    Both text generalization tasks, NEG and LG, are performed by DataPreprocessing class (preprocessing.py file).
    Firstly, part-of-speach tagging is required which is performed by _pos_tagging_of_dataset_and_vocabulary_of_words_posfrequent() method for Gigaword dataset and _pos_tagging_of_duc_dataset_and_vocab_posfrequent() method for DUC dataset. Then the NEG and LG strategy can be applied as follows:

    1. NEG Strategy
      The annotation of named entities is performed by _ner_of_dataset_and_vocabulary_of_nerwords() method for Gigaword dataset and _ner_of_duc_dataset_and_vocab_ofne() method for DUC dataset. Then the methods _conver_dataset_with_ner_from_stanford_andwordnet() for Gigaword dataset and _conver_duc_dataset_with_ner_from_stanford_andwordnet() for DUC dataset generalize these datasets according to NEG strategy having set the parameters accordingly.

    2. LG Strategy
      The _word_freq_hypernympaths() method produces a file that contains a vocabulary with the frequency and the hypernym path of each word. Then this file is used by _vocab_based_onhypernyms() method in order to produce a file that contains a vocabulary with those words that are candidates for generalization. Finally, for the Gigaword dataset, the _convert_dataset_togeneral() method produces the files with summary-article pairs which constitute the generalized dataset, while for DUC dataset the _convert_duc_dataset_based_on_level_ofgeneralizetion() method is used. The hyperparameters of these methods should be set accordingly.

  3. Building dataset for training, validation and testing
    The BuildDataset class (_builddataset.py file) creates the files which are given as input to the deep learning model for training, validation or testing.
    To build the dataset, the appropriate file paths should be set in the __inint__() of BuildDataset class executing the following commands, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).

    1. Building the training dataset: python build_dataset.py -mode train -model lg100d5g
    2. Building the validation dataset: python build_dataset.py -mode validation -model lg100d5g
    3. Building the testing dataset: python build_dataset.py -mode test -model lg100d5g
  4. Training
    The process of training is performed by Train Class (file _trainv2.py) having set the hyperparameters accordingly. The files which are produced from the previous step of Building dataset are used as input in this phase of training. The process of training is performed by the command: python train.py -model neg100, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).

  5. Post-processing of generalized summaries
    In the phase of testing, the task of post-processing of the generalized summaries, which are produced by the deep learning model, is required to replace the generalized concepts of the generalized summary with the specific ones from the corresponding original articles. This task is performed by PostProcessing class by setting the parameters in __init__() method accordingly. More specifically, the mode should be set to "lg" or "neg" according to the employed text generalization strategy. Also, the hyperparameters of _negpostprocessing() and _lgpostprocessing() methods for file paths, text similarity function and the context window should be set accordingly.

  6. Testing
    The Testing class (file testing.py) performs the process of testing of this framework. For the Gigaword dataset, a subset of its test set (e.g. 4000 instances) should be used in order to evaluate the framework while for the DUC dataset, the whole set of instances is used. The Testing class requires the official ROUGE package for measuring the performance of the proposed framework.
    In order to perform the task of testing, the appropriate file paths should be set in the __init__() of Testing class running one of the following modes:

    1. Testing for gigaword: python testing.py -mode gigaword
    2. Testing for duc: python testing.py -mode duc
    3. Testing for duc capped to 75 bytes: python testing.py -mode duc75b

Setting parameters and paths
The values of hyperparameters should be specified in the file parameters.py, while the paths of the corresponding files should be set in the file paths.py.
Additionally, a file with word embeddings (e.g. word2vec) is required where its file path and the dimension of the vectors (e.g. 300) should be specified in the files paths.py and parameters.py, respectively.

The project was developed in python 3.5 and the required python packages are included in the file requirements.txt.

The above described code includes the functionality that was used in the experimental procedure of the corresponding paper. However, the proposed framework is not limited by the current implementation as it is based on a well defined theoretical model that may provide the possibility of enhancing its performance by extending or improving this implementation (e.g. using a better taxonomy of concepts, a different machine learning model or an alternative similarity method for the post-processing task).