Java code from the 2008 EMNLP paper "Bayesian Unsupervised Topic Segmentation" by Eisenstein and Barzilay
Copyright (C) 2008 Massachusetts Institute of Technology
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
The directory contents of this distribution are as follows:
This is a java-based, platform-independent implementation. The class files provided here require Java Runtime Environment (JRE) 6.0 or higher. If you have a lower version, you may recompile by running "ant rebuild." They have been tested to run when compiled to JRE 5.0.
This system contains code and data necessary to reproduce the "textbook" results from the paper. To evaluate a system, type
./eval config/CONFIGFILE
CONFIGFILE indicates the name of the configuration file. A separate configuration file is included for each segmenter, as described above.
The command will evaluate the segmenter the following files from the textbook dataset: files 001.ref, 101.ref, and 201.ref. You can modify the "eval" script to evaluate on other sets of files.
The system will output:
A. The set of all options
B. The location of all data files
C. Information specific to the segmenter.
D. The Pk and WindowDiff for each data file
E. The average Pk and WindowDiff
See the javadoc for edu.mit.nlp.segmenter.mcmc.CuCoSeg.printStatus() for details on the status output of the MCMC segmenter.
To segment a text, provide it through stdin:
cat filename | ./segment config/CONFIGFILE
For example, the command
cat data/books/clinical/050.ref | ./segment config/dp.config
will run the Bayesian cohesion segmenter with dynamic programming on the text file 050.ref. It will output the indices of the last sentences in each topic segment.
The number of segments is read from the file itself. The "segment" script shows how to change the desired number of segments, and how to get debug output. Note that the MCMC cue phrase segmenter was not really intended to be run on individual documents, and may not work well for this purpose.
The build system uses Jakarta Ant framework (http://jakarta.apache.org/ant/).
To build in unix, set your current working directory to the root directory of the distribution, where the file "build.xml" is located, and enter the command "ant build", optionally followed by target name (build, clean, save or docs)
This code is copyright 2008 by the Massachusetts Instute of Technology
The hierarchical segmenter from the paper J. Eisenstein. Hierarchical text segmentation from multi-scale lexical cohesion. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Boulder, CO, 2009.
is available as a release https://github.com/jacobeisenstein/bayes-seg/releases/tag/v1.0