syllog1sm / redshift

Transition-based statistical parser
418 stars 52 forks source link


This library is research code, and is in maintainence mode.

For my actively developed, commercially-focussed NLP library, see

Redshift is a natural-language syntactic dependency parser. The current release features fast and accurate parsing, but requires the text to be pre-processed. Future releases will integrate tokenisation and part-of-speech tagging, and have special features for parsing informal text.

If you don't know what a syntactic dependency is, read this:

Main features:

Key techniques:

Example usage

Here is an example of how the parser is called from Python, once you have a model trained:


>>> import redshift.parser
>>> from redshift.sentence import Input
>>> parser = redshift.parser.Parser(<model directory>)
>>> sentence = Input.from_untagged(['A', 'list', 'of', 'tokens', 'is', 'required', '.'])
>>> parser.parse(sentence)
>>> print sentence.to_conll()

The command-line interfaces have a lot of probably-confusing options for my current research. The main scripts I use are scripts/, scripts/, and scripts/ . All print usage information, and require the plac library.

From a Unix/OSX terminal, after compilation, and within the "redshift" directory:


$ export PYTHONPATH=`pwd`
$ ./scripts/ # Use -h or --help for more detailed info. Most of these are research flags.
usage: [-h] [-a static] [-i 15] [-k 1] [-f 10] [-r] [-d] [-u] [-n 0] [-s 0] train_loc model_loc error: too few arguments
$ ./scripts/ -k 16  <CoNLL formatted training data> <output model directory>
$ ./scripts/ <model directory produced by> <input> <output_dir>
$ ./scripts/ output_dir/parses <gold file>

In more detail:


The following commands will set up a virtualenv with Python 2.7.5, the parser, and its core dependencies from scratch::

$ git clone
$ cd redshift
$ git checkout develop

a) $ virtualenv .env
b) $ ./ # Downloads Python 2.7.5 and virtualenv

$ source .env/bin/activate
$ pip install distribute
$ pip install cython
$ pip install thinc
$ pip install -r requirements.txt
$ export PYTHONPATH=`pwd`:$PYTHONPATH # ...and set PYTHONPATH.
$ fab make test

The script downloads and compiles Python 2.7.5, and uses it to create a virtualenv. This is one way to use a version of Python that isn't system-wide, or to control the compiler that Cython will use. You may not need to do this, or you may wish to do it manually --- it's up to you.

virtualenv is not a requirement, although it's useful. If a virtualenv is not active (i.e. if the $VIRTUALENV environment variable is not set), you'll need to ensure that the file knows where to find the C headers that the murmurhash dependency installs.

Installation requires a recent version of pip, which is provided by the version of virtualenv that the script downloads. If you don't use the script, ensure you're using a recent version of pip.


redshift is written almost entirely in Cython, a superset of the Python language that additionally supports calling C/C++ functions and declaring C/C++ types on variables and class attributes. This allows the compiler to generate very efficient C/C++ code from Cython code. Many popular Python packages, such as numpy, scipy and lxml, rely heavily on Cython code.

A Cython source file such as redshift/parser.pyx is compiled into redshift/parser.cpp and redshift/ by the project's file. The module can then by imported by standard Python code, although only the pure-Python functions (declared by "def" and "cpdef", instead of "cdef") will be accessible.

The parser currently has Cython as a requirement, instead of distributing the "compiled" .cpp files as part of the release (against Cython's recommendation). This could change in future, but currently it feels strange to have a "source" release that users wouldn't be able to modify.


This software is available for non-commercial use only. You may download, run and modify the code for research purposes, personal interest, education, teaching, etc. My commercial NLP suite is spaCy: .


Copyright (C) 2014 Matthew Honnibal