"THE BERKELEY PARSER" release 1.1 migrated from Google Code to GitHub July 2015
This package contains the Berkeley Parser as described in
"Learning Accurate, Compact, and Interpretable Tree Annotation" Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein in COLING-ACL 2006
and
"Improved Inference for Unlexicalized Parsing" Slav Petrov and Dan Klein in HLT-NAACL 2007
If you use this code in your research and would like to acknowledge it, please refer to one of those publications. Note that the jar-archive also contains all source files. For questions please contact Slav Petrov (petrov@cs.berkeley.edu).
java -jar berkeleyParser.jar -gr
The parser can produce k-best lists and parse in parallel using multiple threads. Several additional options are also available (return binarized and/or annotated trees, produce an image of the parse tree, tokenize the input, run in fast/accurate mode, print out tree likelihoods, etc.). Starting the parser without supplying a grammar file will print a list of all options.
java -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA/TreeLabeler -gr
This tool reads in parse trees from STDIN, annotates them as specified and prints them out to STDOUT. You can use
java -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA.TreeScorer -gr
to compute the (log-)likelihood of a parse tree.
GRAMMARS Included are grammars for English, German and Chinese. For parsing English text which is not from the Wall Street Journal, we recommend that you use the English grammar after 5 split&merge iterations as experiments suggest that the 6 split&merge iterations grammars are overfitting the Wall Street Journal. Because of the coarse-to-fine method used by the parser, there is essentially no difference in parsing time between the different grammars.
LEARNING NEW GRAMMARS You will need a treebank in order to learn new grammars. The package contains code for reading in some of the standard treebanks. To learn a grammar from the Wall Street Journal section of the Penn Treebank, you can execute
java -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTrainer -path
To learn a grammar from trees that are contained in a single file use the -treebank option, e.g.:
java -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTrainer -path
This will read in the WSJ training set and do 6 iterations of split, merge, smooth. An intermediate grammar file will be written to disk once in a while and you can expect the final grammar to be written to
java -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -path
java -cp berkeleyParser.jar edu/berkeley/nlp/PCFGLA/WriteGrammarToTextFile
This will create three text files. outname.grammar and outname.lexicon contain the respective rule scores and outname.words should be used with the included perl script to map words to their signatures.