neubig / travatar

This is a repository for the Travatar forest-to-string translation decoder
GNU Lesser General Public License v3.0
28 stars 11 forks source link

Compound splitting #2

Closed kevinduh closed 11 years ago

kevinduh commented 11 years ago

Implemented compound splitting by first subclassing WordSplitter into WordSplitterRegex (which keeps the original WordSplitter functionality) and WordSplitterCompound. WordSplitterCompound works similarly to Moses compound-splitter.perl, i.e. it compares the unigram probability of the word (e.g. "autobahn") with the mean unigram probability of its subwords (e.g. "auto" and "bahn") and picks the one that is higher. It also considers fillers between words and deletes them if necessary (e.g. "arbeitstier" splits into "arbeit"+ (filler=s) + "tier").

Example use: bin/tree-converter -compoundsplit LMfile -compoundsplit_filler "es:s:e" parsed.de > parsed.split.de

The language model (LMfile) provides the unigram statistics and should be trained on text that matches parsed.de tokenization beforehand. Since we use KenLM, bigram or above is assumed, even though the algorithm only looks at unigrams. The fillers are specified in a colon (:) delimited format.

Additional options for this class are compoundsplit_threshold and compoundsplit_minchar, which determine which words are candidates splitting. Usually we don't want to consider a word for splitting if its unigram probability is above some high threshold, or if its subwords are too short. The default values are probably fine. Using "-debug 1" option will generate statistics on the number of words split.

neubig commented 11 years ago

Great commit!!! Thanks.