scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
769 stars 223 forks source link

Handling unigram and bigram features at the same time in word2features #137

Open AbhishekBose opened 2 years ago

AbhishekBose commented 2 years ago

Hello, I am trying to perform an NER experiment on a custom dataset containing a lot of food items. I have labels for certain unigrams and bigrams for my training data.

My label corpus contains "green chilli" = "vegetable". I don't have "chilli" as a label I am using this label list in order to annotate sentences for NER.

For example:

A sentence might contain a bigram such as "green chilli" with it's associated label = "vegetable"

Currently while generating the features, I am marking both "green" and "chilli" as "vegetable". My annotation pipeline is as follows:

As a result of point number 4, both green and chilli get marked as vegetable

So when I train my model and run inference on a test sentence containing "green chilli", I would get "vegetable", "vegetable" twice.

What would be the best way to annotate this using word2features?