nytimes / ingredient-phrase-tagger

Extract structured data from ingredient phrases using conditional random fields
http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/
Other
786 stars 237 forks source link

CRF Output #6

Open prakhar21 opened 8 years ago

prakhar21 commented 8 years ago

Hi, I am not able to understand to what does these tab separated fields mean.

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

Please, help me out.

Thanks

ericagreene commented 8 years ago

@prakhar21 Those are a list of the tokens (words) and the associated features. The associated code is here. The on the right is the tag that we're trying to predict.

Does that answer your question?

prakhar21 commented 8 years ago

@ericagreene Thanks, that answers my question. There is one more thing that, I wanted to clarify. When I am training on all 180k data and then using my own dataset as validation then, why is it like the predictions that it made with 20k data model are more accurate compared to 180k data model. This is against model training principles. My understanding says, more data is always good for training purpose. Please, share your thoughts on this.

Thanks