special tokens (class based emission probs) are important features of
hunpos and TnT.
For the following regular expressions hunpos learns the tag distribution of
the training corpus separately to give more reliable estimates for open
class items like numbers unseen during training:
^[0-9]+$
^[0-9]+\.$
^[0-9.,:-]+[0-9]+$
^[0-9]+[a-zA-Z]{1,3}$
After this, at tag time, if the word is not found in the lexicon
(numerals are added to the lexicon like all other items) hunpos checks
whether the unseen word matches some of the regexps, and uses the
distribution learned for this regexp to guess the tag.
Now these regexpr are hardcoded in special_tokens.ml file. Need some very
fast regexp matching or something like tranducers.
Original issue reported on code.google.com by hala...@gmail.com on 30 Jun 2007 at 11:54
Original issue reported on code.google.com by
hala...@gmail.com
on 30 Jun 2007 at 11:54