sdoctor7 / bill-summarization

Text Summarization of Congressional Bills: Fall 2017 Capstone project, Columbia University Data Science Institute and Bloomberg
0 stars 1 forks source link

bill-summarization

Text Summarization of Congressional Bills: Fall 2017 Capstone project, Columbia University Data Science Institute and Bloomberg

Download Data

Data are collected from two resources:

  1. Run app of https://github.com/unitedstates/congress

    • clone the repo
    • run command ./run fdsys --collections=BILLS --congress=114 --store=mods,xml,text --bulkdata=False
  2. Download *.zip from https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills

Process and split data

This part clean the bill and summary text and filtered the data we want.

It processes data and creates two folders: 1) file_tokenized which we do not use; 2) finished_files which stores chucked *.bin files for pointer-generator.

  - USAGE: `python make_datafiles.py <data_dir> <train_list_dir> <validate_list_dir> <test_list_dir>`
  - `e.g. python make_datafiles.py './out/113_114_115' './out/train_113_114_115.txt' './out/validate_113_114_115.txt' './out/test.txt_113_114_115'`

Extractive Summarizer (sumy)

Abstractive Summarizer (Pointer Generator)

Details can be found at

Train
python run_summarization.py --mode=train --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/train_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115 --batch_size=32
Evaluation (concurrent)
python run_summarization.py --mode=eval --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/val_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115
Decoding

1) validation data (run all file)

python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/val_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115 --single_pass=1

2) validation data (produce one attn_vis_data.json file for the attention visualizer)

python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/val_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115

3) test data (run all file)

python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/test_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115 --single_pass=1

4) test data (produce one attn_vis_data.json file for the attention visualizer)

python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/test_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115
Visualize one output

Find Budget-related Bills

Feature

Analysis

User Interface