Open niranjanb opened 11 years ago
Made a hack that adjusts the token intervals to better match up with Stanford's tokenization.
Offset adjustments yield about 20% more type assignments on the development set.
Michael: Better fix is to reconstruct text from OpenNLP's tokens so that the assigned token offsets match up with the source sentence offsets.
Stanford's NE span can be a subset of a token span. e.g.
mid-80s' -- 80s is recognized as a time unit.
Type assignment now reconstructs text from tokens which is then submitted to Stanford's NER.
Stanford NE types with sub-token spans are also allowed.
Stanford's NE tagger tokenization differs from OpenNLP tokenization used by Ollie. The offsets don't line up for sentences that contain special characters (e.g. , $, -, ', '').