usc-isi-i2 / dig-crf

CRF++ extraction for DIG
Apache License 2.0
2 stars 1 forks source link

Gibberish with Name-Ethnic Extraction #7

Closed CraigMiloRogers closed 8 years ago

CraigMiloRogers commented 8 years ago

When run with the name-ethnic model, CRF extraction returned the wrong phrases.

CraigMiloRogers commented 8 years ago

The tagged phrase isolation code needs to distinguish between tagged words and non-tagged words. I haven't seen documentation on how this should be done. There are two possible indicators: the tagIndex and the tagName.

In the hair/eyes training model, the untagged words were given tagIndex 0, with tagname 'O'. I programed the code to look for tagIndex 0.

In the name/ethnic training model, untagged words had tagIndex 4, tagname 'O'.

I switched the code to check for tagName != 'O' to look for explicitly tagged words. However, I'd like to find some documentation or code that supports this decision, otherwis the code is morelikely to break when run with a future training model.

CraigMiloRogers commented 8 years ago

I'm closing this for now, but the underlying problem (mysterious and undocumented behaviour from CRF++) remains unresolved.