syllog1sm / redshift

Transition-based statistical parser
418 stars 52 forks source link

Redshift

This library is research code, and is in maintainence mode.

For my actively developed, commercially-focussed NLP library, see http://honnibal.github.io/spaCy/

Redshift is a natural-language syntactic dependency parser. The current release features fast and accurate parsing, but requires the text to be pre-processed. Future releases will integrate tokenisation and part-of-speech tagging, and have special features for parsing informal text.

If you don't know what a syntactic dependency is, read this: http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html

Main features:

Key techniques:

Example usage

Here is an example of how the parser is called from Python, once you have a model trained:

::

>>> import redshift.parser
>>> from redshift.sentence import Input
>>> parser = redshift.parser.Parser(<model directory>)
>>> sentence = Input.from_untagged(['A', 'list', 'of', 'tokens', 'is', 'required', '.'])
>>> parser.parse(sentence)
>>> print sentence.to_conll()

The command-line interfaces have a lot of probably-confusing options for my current research. The main scripts I use are scripts/train.py, scripts/parse.py, and scripts/evaluate.py . All print usage information, and require the plac library.

From a Unix/OSX terminal, after compilation, and within the "redshift" directory:

::

$ export PYTHONPATH=`pwd`
$ ./scripts/train.py # Use -h or --help for more detailed info. Most of these are research flags.
usage: train.py [-h] [-a static] [-i 15] [-k 1] [-f 10] [-r] [-d] [-u] [-n 0] [-s 0] train_loc model_loc
train.py: error: too few arguments
$ ./scripts/train.py -k 16  <CoNLL formatted training data> <output model directory>
$ ./scripts/parse.py <model directory produced by train.py> <input> <output_dir>
$ ./scripts/evaluate.py output_dir/parses <gold file>

In more detail:

Installation

The following commands will set up a virtualenv with Python 2.7.5, the parser, and its core dependencies from scratch::

$ git clone https://github.com/syllog1sm/redshift.git
$ cd redshift
$ git checkout develop

**EITHER**
a) $ virtualenv .env
**OR**
b) $ ./make_virtualenv.sh # Downloads Python 2.7.5 and virtualenv

$ source .env/bin/activate
$ pip install distribute
$ pip install cython
$ pip install thinc
$ pip install -r requirements.txt
$ export PYTHONPATH=`pwd`:$PYTHONPATH # ...and set PYTHONPATH.
$ fab make test

The make_virtualenv.sh script downloads and compiles Python 2.7.5, and uses it to create a virtualenv. This is one way to use a version of Python that isn't system-wide, or to control the compiler that Cython will use. You may not need to do this, or you may wish to do it manually --- it's up to you.

virtualenv is not a requirement, although it's useful. If a virtualenv is not active (i.e. if the $VIRTUALENV environment variable is not set), you'll need to ensure that the setup.py file knows where to find the C headers that the murmurhash dependency installs.

Installation requires a recent version of pip, which is provided by the version of virtualenv that the make_virtualenv.sh script downloads. If you don't use the make_virtualenv.sh script, ensure you're using a recent version of pip.

Cython

redshift is written almost entirely in Cython, a superset of the Python language that additionally supports calling C/C++ functions and declaring C/C++ types on variables and class attributes. This allows the compiler to generate very efficient C/C++ code from Cython code. Many popular Python packages, such as numpy, scipy and lxml, rely heavily on Cython code.

A Cython source file such as redshift/parser.pyx is compiled into redshift/parser.cpp and redshift/parser.so by the project's setup.py file. The module can then by imported by standard Python code, although only the pure-Python functions (declared by "def" and "cpdef", instead of "cdef") will be accessible.

The parser currently has Cython as a requirement, instead of distributing the "compiled" .cpp files as part of the release (against Cython's recommendation). This could change in future, but currently it feels strange to have a "source" release that users wouldn't be able to modify.

LICENSE

This software is available for non-commercial use only. You may download, run and modify the code for research purposes, personal interest, education, teaching, etc. My commercial NLP suite is spaCy: http://spacy.io .

::

Copyright (C) 2014 Matthew Honnibal