nytimes / ingredient-phrase-tagger

Extract structured data from ingredient phrases using conditional random fields
http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/
Other
785 stars 237 forks source link

BIO tagging/chunking bug #12

Open tisdall opened 7 years ago

tisdall commented 7 years ago

The first entry for the test set looks like this:

1   I1  L12 NoCAP   NoPAREN B-QTY
boneless    I2  L12 NoCAP   NoPAREN I-COMMENT
pork    I3  L12 NoCAP   NoPAREN B-NAME
tenderloin  I4  L12 NoCAP   NoPAREN I-NAME
,   I5  L12 NoCAP   NoPAREN B-COMMENT
about   I6  L12 NoCAP   NoPAREN I-COMMENT
1   I7  L12 NoCAP   NoPAREN B-QTY
pound   I8  L12 NoCAP   NoPAREN I-COMMENT

The corresponding CSV entry is: 20000,"1 boneless pork tenderloin, about 1 pound",pork tenderloin,1.0,0.0,,"boneless, about 1 pound"

The second token should be labelled "B-COMMENT" because there's no comment proceeding it.

The issue is with addPrefixes and bestTag. addPrefixes determines that '1' is both the QTY and also part of the entry's comment so it says the possible tags are ['B-COMMENT', 'B-QTY'] it then goes to the next token and determines that it's a COMMENT but tags it as I-COMMENT because the previous token has B-COMMENT as a possible tag. The bestTag picks anything over a COMMENT so it assigns the B-QTY to the '1' and 'boneless' is then tagged incorrectly with I-COMMENT.

Essentially, I think addPrefixes and bestTag should be combined into a single function since BIO chunking really needs to know what the previous tag is actually going to be.

Additionally, it may also be reasonable that if the first instance of '1' is labelled as QTY then the second should be labelled 'COMMENT', but that would be a separate issue apart from the BIO chunking.