jacky168 / dualist

Automatically exported from code.google.com/p/dualist
Other
0 stars 0 forks source link

DUALIST: Utility for Active Learning with Instances and Semantic Terms

Burr Settles Carnegie Mellon University bsettles@cs.cmu.edu

Version 0.3 March 08, 2012

DUALIST is an interactive machine learning system for building classifiers quickly. It does so by asking "questions" of the user in the form of both data instances (e.g., text documents) and features (e.g., words or phrases). It utilizes active and semi-supervised learning to quickly train a multinomial naive Bayes classifier for this setting.

NOTICE: This is currently "research-grade" code. It is provided AS-IS without any warranties of any kind, expressed or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose and those arising by statute or otherwise in law or from a course of dealing or usage of trade. Whew!

See LICENSE.txt for licensing information. See CHANGELOG.txt for a history of updates.

Citation information and technical details:

B. Settles. Closing the Loop: Fast, Interactive Semi-Supervised Annotation 
With Queries on Features and Instances. In Proceedings of the Conference 
on Empirical Methods in Natural Language Processing (EMNLP), to appear. 
ACL Press, 2011.

PURPOSE & GOAL

The purpose of DUALIST is threefold:

  1. A practical tool to expedite annotation/learning in NLP tasks.

  2. A framework to facilitate research in interactive and multi-modal active learning. This includes enabling actual user experiments with the GUI (as opposed to simulated experiments, which are pervasive in the literature but sometimes practically inconclusive) as well as developing more advanced dual supervision strategies which are fast enough to be interactive, accurate enough to be useful, and perhaps make more appropriate modeling assumptions than the multinomial naive Bayes classifier currently used.

  3. A starting point for more sophisticated interactive learning scenarios that combine multiple "beyond supervised learning" strategies. This ICML workshop is related: https://sites.google.com/site/comblearn/

INTALLATION + RUNNING THE GUI

DUALIST requires Java 1.6 and Python 2.5 to work properly. It ships with most of the dependencies it needs to work, the only exception being the Play! web framework for Java v1.1+, which can be downloaded here:

http://download.playframework.org/releases/play-1.1.zip

Download and install Play! wherever you want on your system (follow the instructions on their website), and make sure that the "play" command is in your $PATH. Once that is done, all you need to do to run DUALIST is:

$ cd <path-to>/dualist
$ dualist gui

This will launch a web server on your machine, which you can access by pointing your favorite browser to:

http://localhost:8080/

And follow the instructions on the screen. DUALIST has only been tested on Mac OS 10.6 and Ubuntu Linux, but it should be platform-independent and work in any unix-like environment (and even Windows). Make sure you don't have any other processes listening on the 8080 port of your machine.

NOTE: DUALIST is written to run on a single computer and loads all data into memory. It is robust for hundreds of thousands of instances and features on modern hardware, but may be difficult to use beyond that.

LOGS AND OUTPUT

DUALIST writes a log of user actions in the "results/" directory. Trained models are archived as learning progresses in the "models/" directory. Web server system output is written to "application.log" in the root directory.

In "Explore" mode, you can click the "predict" button at the bottom of the page at any time to get the current model's label predictions, followed by the set of labeled instances and features/terms (prepended by the '#' character).

USING TRAINED MODELS

Trained models are stored in the "models/" directory. There are two utilities for using these models to apply or evaluate these models on data:

$ dualist classify [model] [documents...]

This takes a model file and any number of either raw-text or ZIP archive files in the appropriate data format (see data file formats section below). DUALIST will then output predictions to STDOUT in a tab-delimted format:

textID  label1  prob1   label2  prob2   ... text-summary

The label predictions are output in rank order, thus column #2 corresponds to the model's most likely prediction, and column #3 is its posterior probability, and so on. The text summary in the final columns is a snippet of the first 150 characters in the instance.

The other utility, for evaluation, is:

$ dualist test [model] [test-set]

This will produce various statistics about the model and data set, as well as the model's accuracy compared to a 10-fold cross-validation baseline using the same test set.

DATA FILE FORMATS

In either explore or experiment mode, DUALIST accepts data sets as a single ZIP file. In "explore" mode, data files can be an arbitrary structure within the archive, it is only required that they be zipped. You define the class labels yourself in the setup for explore mode.

In "experiment" mode, instances must have labels which are defined by subdirectories within the archive. For example, for a classification task with two labels "foo" and "bar," the ZIP archive structure would look like this:

foo/foo-file1.txt
foo/foo-file2.txt
...
bar/bar-file1.txt
bar/bar-file2.txt
...

DUALIST comes with four built-in data processing setups:

Documents: Each document (e.g., foo-file1.txt above) is its own instance. The default feature representation is bag-of-unigrams, lowercased, with stopwords removed.

Simple Lines: Each line of text is an instance (thus the archive can be composed of a single file). The feature representation is the same as "Documents," plus bigrams.

Tweets: The same as "Simple Lines," plus features for "emoticons" :) and twitter-specific semantics (e.g., @username, http://links, and #hashtags).

Entities: One line per instances, in tab-delimited format. The instance name (i.e., a noun phrase) is represented by the first element of the line, and each subsequent element is a contextual feature, represented by "feature||value" (using "||" as a delimiter). Orthographic features (word shape, affixes, etc.) are induced automatically.

CUSTOMIZATION

To create your own data processing pipelines, follow these steps:

1. Familiarize yourself with the "cc.mallet.pipe" package API
(http://mallet.cs.umass.edu/api/)

2. Implement a new pipe in the "dualist.pipes" package of the DUALIST
codebase (use "DocumentPipe.java" as an example).

3. Edit the following files to incorporate the new pipeline into the 
web-based user interface:
    core/src/dualist/tui/Util.java (the "getPipe" method)
    gui/app/views/Applications/experiment.html
    gui/app/views/Applications/explore.html

4. Changes made to the "core/" section of the codebase must be manually 
compiled by typing the "ant" command. You may need to stop and restart the 
GUI in this case.

5. Changes made to the "gui/" section of the codebase are re-compiled on 
the fly by the Play! web framework.

6. For more advanced deployment of the web-based GUI, you will probably 
need to edit the file "gui/app/conf/application.conf". Refer the the Play! 
documentation for more details: 
http://www.playframework.org/documentation/1.1/production

Good luck, and have fun!