Text Summarization of Congressional Bills: Fall 2017 Capstone project, Columbia University Data Science Institute and Bloomberg
Data are collected from two resources:
Run app of https://github.com/unitedstates/congress
./run fdsys --collections=BILLS --congress=114 --store=mods,xml,text --bulkdata=False
Download *.zip from https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills
This part clean the bill and summary text and filtered the data we want.
filter_and_prepare_data.ipynb
1) Load, Deduplicate, Filter and Split data
document.xml
, if it not exists, just pick the first../out/<congress_number>
Files naming convention: 'BILL' + '_' + row['ID'].out
and 'SUMMARY' + '_' + row['ID'].out
(e.g. BILL_113_HR1_IH.out
)
| Congress | Count |
| ------ | ------ |
| 115 | 4114 |
| 114 | 10045 |
| 113 | 8903 |
2) Split train, val, test
| Train | Validation | Test | Total |
| ------ | ------ | ------ | ------ |
| 18449 | 2306 | 2307 | 23062 |
make_datafiles.py
It processes data and creates two folders: 1) file_tokenized
which we do not use; 2) finished_files
which stores chucked *.bin files for pointer-generator.
- USAGE: `python make_datafiles.py <data_dir> <train_list_dir> <validate_list_dir> <test_list_dir>`
- `e.g. python make_datafiles.py './out/113_114_115' './out/train_113_114_115.txt' './out/validate_113_114_115.txt' './out/test.txt_113_114_115'`
Details can be found at
python run_summarization.py --mode=train --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/train_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115 --batch_size=32
python run_summarization.py --mode=eval --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/val_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115
1) validation data (run all file)
python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/val_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115 --single_pass=1
2) validation data (produce one attn_vis_data.json file for the attention visualizer)
python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/val_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115
3) test data (run all file)
python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/test_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115 --single_pass=1
4) test data (produce one attn_vis_data.json file for the attention visualizer)
python run_summarization.py --mode=decode --data_path="/home/lucy/Workspace/bill-summarization/finished_files/chunked/test_*" --vocab_path="/home/lucy/Workspace/bill-summarization/finished_files/vocab" --log_root="/home/lucy/Workspace/pointer-generator/log" --exp_name=bill-113-114-115
python -m SimpleHTTPServer
(python2 only)filter_summarize.ipynb