nttcslab-nlp / Top-Down-RST-Parser

This repository is the implementation of "Top-down RST Parsing Utilizing Granularity Levels in Documents" published at AAAI 2020.
Other
19 stars 5 forks source link

Top-Down RST Parser

This repository is the implementation of "Top-down RST Parsing Utilizing Granularity Levels in Documents" published at AAAI 2020.

Requirements

python 3.6 or newer
libraries:

Usage

We use trees.py in our code. Please put it in src/dataset/.

Preprocess

Before running a script, you need to add a path to Dataset preprocessed by Heilman's code into script/preprocess.sh.

bash script/preprocess.sh

Training

Train the model 5 times for D2E, D2P, D2S, P2S, P2E and S2E. If you need to select a GPU device, please use an environment variable CUDA_VISIBLE_DEVICES.

bash script/training.sh

Evaluating

Evaluate on test set for D2E, D2S2E and D2P2S2E with 5 ensemble setting.

bash script/evaluate.sh

Data format

We use RSTDT dataset preprocessed by Heilman's code. In our preprocessing, each data take following jsonl format. There is sample files of our preprocessing in data/sample/.

"doc_id": "wsj_****"
"rst_tree": "(ROOT (nucleus:Span (text 0) (satellite:Elaboration (text 1))))"
"labelled_attachment_tree": "(nucleus-satellite:Elaboration (text 0) (text 1))"
"tokenized_strings": ["first sentence corresponding to text 1 .", "and this is second sentence ."]
"raw_tokenized_strings": ["first", "sentence", "corresponding", "to", "text", "1", ".", "and", "this", "is", "second", "sentence", "."]
"starts_sentence": [true, true]
"starts_paragraph": [true, false]
"parent_label": null
"granularity_type": D2E

Reference

@inproceedings{Kobayashi2020TopDownRP,
  title={Top-Down RST Parsing Utilizing Granularity Levels in Documents},
  author={Naoki Kobayashi and Tsutomu Hirao and Hidetaka Kamigaito and Manabu Okumura and Masaaki Nagata},
  booktitle={Proceedings of the 2020 Conference on Artificial Intelligence for the American (AAAI)},
  month={sep},
  year={2020},
  pages={8099--8106}
}

LICENSE

This software is released under the NTT License, see LICENSE.txt.

According to the license, it is not allowed to create pull requests. Please feel free to send issues.