jeisner / treebank-scripts

Suite of scripts for preprocessing the Penn Treebank, primarily to extract lexical subcategorization frames and dependencies.
MIT License
7 stars 1 forks source link

treebank-scripts

Suite of scripts for preprocessing the Penn Treebank, primarily to extract lexical subcategorization frames and dependencies.

Author: Jason Eisner jason@cs.jhu.edu

These scripts are filters that process the Penn Treebank. They can be pipelined together in various combinations. Their main purpose is to extract lexical subcategorization frames or lexical dependencies from the Penn Treebank. In particular, they can mark the head child of a constituent. They also convert empty categories into slashed nonterminals.

Overview

The scripts read and write files in a common format. To convert from the original Penn Treebank to this format, use the oneline script. Some features of the format:

Usage

Each script is documented through initial comments. For an explanation of how some of these scripts are pipelined in a typical case, see section 6.2 ("Data Preparation") of Jason Eisner's Ph.D. thesis.

If you plan to run the scripts from the command line, then you may want to add the script directory to your PATH environment variable. The stamp.inc script needs to be in a directory that is listed in the PERL5LIB or PERLLIB environment variable.

Alternatively, you can run the scripts by invoking Perl directly, e.g.,

perl -I/path/to/treebank-scripts /path/to/treebank-scripts/oneline

Documentation Files

Scripts

Data Files

History

These materials are primarily by Jason Eisner jason@cs.jhu.edu, with some later improvements by Martin Cmejrek cmejrek@ufal.ms.mff.cuni.cz.