neilctwu / YouyakuMan

Extractive summarizer using BertSum as summarization model
53 stars 15 forks source link

YouyakuMan

Unstable License

Introduction

This is an one-touch extractive summarization machine.

using BertSum as summatization model, extract top N important sentences.

img


Prerequisites

General requirement

pip install torch
pip install transformers
pip install googletrans

Japanese specific requirement


Pretrained Model

English: Here

Japanese: Here

Download and put under directory checkpoint/en or checkpoint/jp


Example

$python youyakuman.py -txt_file YOUR_FILE -lang LANG -n 3 --super_long

Note

Since Bert only takes 512 length as inputs, this summarizer crop articles >512 length.

If --super_long option is used, summarizer automatically parse to numbers of 512 length inputs and summarize per inputs. Number of extraction might slightly altered with --super_long used.


Train Example

$python youyakumanJPN_train.py -data_folder [training_txt_path] -save_path [model_saving_path] -train_from [pretrained_model_file]
"""
-data_folder : path to train data folder, structure showed as below:
                training_txt_path
                ├─ article1.pickle
                ├─ article2.pickle
                ..    
"""

Train Data Preparation

Training data should be a dictionary saved by pickle, to be specifically, a dictionary containing below contents of one article.

{'body': 'TEXT_BODY', 'summary': 'SUMMARY_1<sep>SUMMARY_2<sep>SUMMARY3'}

Version Log:

2020-08-03 Updated to transformer package, remove redudndancy, model saving format while training

2020-02-10 Training part added

2019-11-14 Add multiple language support

2019-10-29 Add auto parse function, available for long article as input