knowitall / relgrams

Relgrams -- Tool for computing relational co-occurrences.
2 stars 0 forks source link

Named entity class assignment fails due to tokenization differences. #7

Open niranjanb opened 11 years ago

niranjanb commented 11 years ago

Stanford's NE tagger tokenization differs from OpenNLP tokenization used by Ollie. The offsets don't line up for sentences that contain special characters (e.g. , $, -, ', '').

niranjanb commented 11 years ago

Made a hack that adjusts the token intervals to better match up with Stanford's tokenization.

Offset adjustments yield about 20% more type assignments on the development set.

niranjanb commented 11 years ago

Michael: Better fix is to reconstruct text from OpenNLP's tokens so that the assigned token offsets match up with the source sentence offsets.

niranjanb commented 11 years ago

Stanford's NE span can be a subset of a token span. e.g.

mid-80s' -- 80s is recognized as a time unit.

niranjanb commented 11 years ago

Type assignment now reconstructs text from tokens which is then submitted to Stanford's NER.

Stanford NE types with sub-token spans are also allowed.