Distance Parser

Distance parser is a supervised constituency parser based on syntactic distance. This repo is a working sample of distance parser which reproduces the results reported in the paper Straight to the Tree: Constituency Parsing with Neural Syntactic Distance, which is published in ACL 2018. We provide models with proper configurations for PTB and CTB datasets, as well as their preprocessing scripts.

Requirements

PyTorch We use PyTorch 0.4.0 with python 3.6.
Stanford POS tagger. We use the full Stanford Tagger, version 3.9.1, build 2018-02-27.
NLTK We use NLTK 3.2.5.
EVALB We have integrated a compiled EVALB inside our repo. This compiled version is forked from the current latest verison of EVALB, which can be accessed through this link.

Datasets and Preprocessing

Preprocessing PTB

We use the same preprocessed PTB files from the self attentive parser repo. GloVe embeddings are optional if you don't want to run the ablation experiments.

To preprocess PTB, please follow the steps below:

Download the 3 PTB data files from https://github.com/nikitakit/self-attentive-parser/tree/master/data, and put them in the data/ptb folder.

Run the following command to prepare the PTB data:

python datacreate_ptb.py ../data/ptb /path/to/glove.840B.300d.txt

Preprocessing CTB

We use the standard train/valid/test split specified in Liu and Zhang, 2017 for our CTB experiments.

To preprocess the CTB, please follow the steps below:

Download and unzip the Chinese Treebank dataset from https://wakespace.lib.wfu.edu/handle/10339/39379
If you don't have any corpus data in NLTK before, download some to initialize your nltk_data folder, such as:
```
python -c "import nltk; nltk.download('ptb')"
```
Run the following command to link the dataset to NLTK, and generate the train/valid/test split in the repo:
```
python ctb.py --ctb /path/to/your/ctb8.0/data --output data/ctb_liusplit
```
Integrate the Stanford Tagger for data preprocessing. Download the Stanford tagger from https://nlp.stanford.edu/software/stanford-postagger-full-2018-02-27.zip and unzip it.

Run the following command to generate the preprocessed files:

python datacreate_ctb.py ../data/ctb_liusplit /pth/to/stanford/tagger/

Experiments

For reproducing the PTB results in table 1, run

cd src
python dp.py --cuda --datapath ../data/ptb --savepath ../ptbresults --epc 200 --lr 0.001 --bthsz 20 --hidsz 1200 --embedsz 400 --window_size 2 --dpout 0.3 --dpoute 0.1 --dpoutr 0.2 --weight_decay 1e-6

For reproducing the CTB results in table 2, run

cd src
python dp.py --cuda --datapath ../data/ctb_liusplit --savepath ../ctbresults --epc 200 --lr 0.001 --bthsz 20 --hidsz 1200 --embedsz 400 --window_size 2 --dpout 0.4 --dpoute 0.1 --dpoutr 0.1 --weight_decay 1e-6

Pre-trained models

We provide pre-trained models for the convenience of users. The following steps download the two pre-trained models to your repo:

mkdir results/
cd results/
wget http://lisaweb.iro.umontreal.ca/transfert/lisa/users/linzhou/distance_parser_pretrained_model/ctb.th
wget http://lisaweb.iro.umontreal.ca/transfert/lisa/users/linzhou/distance_parser_pretrained_model/ptb.th

To re-evaluate the pre-trained models, run:

cd src/
python demo.py --cuda --datapath ../data/ptb/ --filename ptb     # this command reproduces the 92.0 F1 score for PTB
python demo.py --cuda --datapath ../data/ctb_liusplit/ --filename ctb     # this command reproduces the 86.5 F1 score for CTB

Note that the file has to be in the results folder inorder for the demo.py script to load it automatically.

hantek / distance-parser

readme