scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
771 stars 221 forks source link

Need more explanation about making features from word #63

Open karterotte opened 7 years ago

karterotte commented 7 years ago

I plan to do some Chinese named entity recogition. I want to know how to make features in this condition,like: "广东省中山市坦洲镇南坦路232号牡丹酒店"

In this sentence,each char should be "a word" in English. I have made training data like this:

广 B-Province 东 M-Province 省 E-Province 中 B-City 山 M-City 市 E-City ... 南 B-Road 坦 M-Road 路 E-Road 232 B-Number 号 E-Number

But I don't know how to change it to pycrfsuite.ItemSequence.I want to add more connection features like two words before| target word |two words after,and some position features like is_head、is_end.

Could you show a demo ? Thank you. ^ ^

kmike commented 7 years ago

crfsuite doesn't allow arbitrary CRFs; it implements only linear-chain CRF model with 1st order connections, i.e. there is a connection between the current (i-th) label and a previous ([i-1]) label, but there is no connection between current label and [i-2] label.

You can workaround this to some extent by using features from other words, just like in the existing NER example - it uses features from previous and next words, but you can use features from any words in a sequence.

If you need arbitrary CRFs, you need another CRF package; I don't have experience with them, but factorie and pystruct looks popular.

karterotte commented 7 years ago

Thanks for your reply. One more thing I want to know is how to make features. Like a sentence:"ABC"

A label1 B label2 C label3

What's the features for A、B、C in crfsuite? (A doesn't have -1 label and C doesn't have +1 label) Could you show me a demo? Thanks!

kmike commented 7 years ago

@karterotte I'm not sure I understood your question correctly, but anyways :) Extracting features from A, B, C is up to you; I don't know Chinese and so don't know which features are good for Chinese NER. For European languages you usually have features like "word=A", "word endswith ...", "word is in a dictionary of geo locations", etc., as well as the same features from some words around the current word.

There is an example in docs on how to extract such features: https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb (an alternative version: https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb).