Implemented compound splitting by first subclassing WordSplitter into WordSplitterRegex (which keeps the original WordSplitter functionality) and WordSplitterCompound. WordSplitterCompound works similarly to Moses compound-splitter.perl, i.e. it compares the unigram probability of the word (e.g. "autobahn") with the mean unigram probability of its subwords (e.g. "auto" and "bahn") and picks the one that is higher. It also considers fillers between words and deletes them if necessary (e.g. "arbeitstier" splits into "arbeit"+ (filler=s) + "tier").
Example use:
bin/tree-converter -compoundsplit LMfile -compoundsplit_filler "es:s:e" parsed.de > parsed.split.de
The language model (LMfile) provides the unigram statistics and should be trained on text that matches parsed.de tokenization beforehand. Since we use KenLM, bigram or above is assumed, even though the algorithm only looks at unigrams. The fillers are specified in a colon (:) delimited format.
Additional options for this class are compoundsplit_threshold and compoundsplit_minchar, which determine which words are candidates splitting. Usually we don't want to consider a word for splitting if its unigram probability is above some high threshold, or if its subwords are too short. The default values are probably fine. Using "-debug 1" option will generate statistics on the number of words split.
Implemented compound splitting by first subclassing WordSplitter into WordSplitterRegex (which keeps the original WordSplitter functionality) and WordSplitterCompound. WordSplitterCompound works similarly to Moses compound-splitter.perl, i.e. it compares the unigram probability of the word (e.g. "autobahn") with the mean unigram probability of its subwords (e.g. "auto" and "bahn") and picks the one that is higher. It also considers fillers between words and deletes them if necessary (e.g. "arbeitstier" splits into "arbeit"+ (filler=s) + "tier").
Example use: bin/tree-converter -compoundsplit LMfile -compoundsplit_filler "es:s:e" parsed.de > parsed.split.de
The language model (LMfile) provides the unigram statistics and should be trained on text that matches parsed.de tokenization beforehand. Since we use KenLM, bigram or above is assumed, even though the algorithm only looks at unigrams. The fillers are specified in a colon (:) delimited format.
Additional options for this class are compoundsplit_threshold and compoundsplit_minchar, which determine which words are candidates splitting. Usually we don't want to consider a word for splitting if its unigram probability is above some high threshold, or if its subwords are too short. The default values are probably fine. Using "-debug 1" option will generate statistics on the number of words split.