sfedia / hunpos

Automatically exported from code.google.com/p/hunpos
0 stars 0 forks source link

refactoring of special tokens #4

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
special tokens (class based emission probs) are important features of
hunpos and TnT. 

For the following regular expressions hunpos learns the tag distribution of
the training corpus separately to give more reliable estimates for open
class items like numbers unseen during training:

^[0-9]+$ 
^[0-9]+\.$      
^[0-9.,:-]+[0-9]+$
^[0-9]+[a-zA-Z]{1,3}$ 

After this, at tag time, if the word is not found in the lexicon
(numerals are added to the lexicon like all other items) hunpos checks
whether  the unseen word matches some of the regexps, and uses the
distribution learned for this regexp to guess the tag.

Now these regexpr are hardcoded in special_tokens.ml file. Need some very
fast regexp matching or something like tranducers.

Original issue reported on code.google.com by hala...@gmail.com on 30 Jun 2007 at 11:54