kedz / nnsum

An extractive neural network text summarization library for the EMNLP 2018 paper "Content Selection in Deep Learning Models of Summarization" (https://arxiv.org/abs/1810.12343).
108 stars 19 forks source link

nnsum

An extractive neural network text summarization library for the EMNLP 2018 paper Content Selection in Deep Learning Models of Summarization (https://arxiv.org/abs/1810.12343).

Installation

  1. Install pytorch using pip or conda.
  2. run:
    git clone https://github.com/kedz/nnsum.git
    cd nnsum
    python setup.py install
  3. Get the data:
    git clone https://github.com/kedz/summarization-datasets.git
    cd summarization-datasets
    python setup.py install

    See README.md in summarization-datasets for details on how to get each dataset from the paper.

Training A Model

All models from the paper can be trained from the same convenient training script: script_bin/train_model.py. The general pattern for usage is:

python script_bin/train_model.py \
  --trainer TRAINER_ARGS --emb EMBEDDING_ARGS \
  --enc ENCODER_ARGS --ext EXTRACTOR_ARGS

Every model has a set of word embeddings, a sentence encoder, and a sentence extractor. Each argument section allows you to pick an architecture/options for that component. For the most part, defaults match the paper's primary evaluation settings. For example, to train the CNN encoder with Seq2Seq extractor on gpu 0, run the following:

python script_bin/train_model.py \
    --trainer --train-inputs PATH/TO/INPUTS/TRAIN/DIR \
              --train-labels PATH/TO/LABELS/TRAIN/DIR \
              --valid-inputs PATH/TO/INPUTS/VALID/DIR \
              --valid-labels PATH/TO/LABELS/VALID/DIR \
              --valid-refs PATH/TO/HUMAN/REFERENCE/VALID/DIR \
              --weighted \
              --gpu 0 \
              --model PATH/TO/SAVE/MODEL \
              --results PATH/TO/SAVE/VALIDATION/SCORES \
              --seed 12345678 \
    --emb --pretrained-embeddings PATH/TO/200/DIM/GLOVE \
    --enc cnn \
    --ext s2s --bidirectional 

Trainer Arguments

These arguments set the data to train on, batch sizes, training epochs, etc.

Embedding Arguments

These arguments set the word embeddings size, the path to pretrained embeddings, whether to fix the embeddings during learning, etc.

Encoder Arguments

The encoder arguments select for one of three sentence encoder architectures: avg, rnn, or cnn, and their various parameters e.g. --enc avg or --enc cnn CNN_ARGUMENTS. The sentence encoder takes a sentence, i.e. an arbitrarily long sequence of word embeddings and encodes them as a fixed length embedding. Below we describe the options for each architecture.

Averaging Encoder

A sentence embedding is simply the average of the word embeddings.

Extractor Arguments

The extractor arguments select for one of four sentence extractor architectures: rnn, s2s, cl, or sr, and their various parameters e.g. --ext cl or --enc s2s S2S_ARGUMENTS. The sentence extractor takes an arbitrarily long sequence of sentence embeddings and predicts whether each sentence should be included in the summary. Below we describe the options for each architecture.

RNN Extractor

Sentence embeddings are run through an RNN and then fed into a multi-layer perceptron MLP to predict sentence extraction.

SummaRunner Extractor

This is an implementation of the sentence extractive summarizer from: https://arxiv.org/abs/1611.04230

Evaluating a Model

To get a model's ROUGE scores on the train, validation, or test set use script_bin/eval_model.py.

E.g.:

python eval_model.py \
  --inputs PATH/TO/INPUTS/DIR \
  --refs PATH/TO/REFERENCE/DIR \
  --model PATH/TO/MODEL \
  --results PATH/TO/WRITE/RESULTS \
  --summary-length 100 

Eval script parameters are described below: