Closed CraigMiloRogers closed 8 years ago
The tagged phrase isolation code needs to distinguish between tagged words and non-tagged words. I haven't seen documentation on how this should be done. There are two possible indicators: the tagIndex and the tagName.
In the hair/eyes training model, the untagged words were given tagIndex 0, with tagname 'O'. I programed the code to look for tagIndex 0.
In the name/ethnic training model, untagged words had tagIndex 4, tagname 'O'.
I switched the code to check for tagName != 'O' to look for explicitly tagged words. However, I'd like to find some documentation or code that supports this decision, otherwis the code is morelikely to break when run with a future training model.
I'm closing this for now, but the underlying problem (mysterious and undocumented behaviour from CRF++) remains unresolved.
When run with the name-ethnic model, CRF extraction returned the wrong phrases.