Open karterotte opened 7 years ago
crfsuite doesn't allow arbitrary CRFs; it implements only linear-chain CRF model with 1st order connections, i.e. there is a connection between the current (i-th) label and a previous ([i-1]) label, but there is no connection between current label and [i-2] label.
You can workaround this to some extent by using features from other words, just like in the existing NER example - it uses features from previous and next words, but you can use features from any words in a sequence.
If you need arbitrary CRFs, you need another CRF package; I don't have experience with them, but factorie and pystruct looks popular.
Thanks for your reply. One more thing I want to know is how to make features. Like a sentence:"ABC"
A label1 B label2 C label3
What's the features for A、B、C in crfsuite? (A doesn't have -1 label and C doesn't have +1 label) Could you show me a demo? Thanks!
@karterotte I'm not sure I understood your question correctly, but anyways :) Extracting features from A, B, C is up to you; I don't know Chinese and so don't know which features are good for Chinese NER. For European languages you usually have features like "word=A", "word endswith ...", "word is in a dictionary of geo locations", etc., as well as the same features from some words around the current word.
There is an example in docs on how to extract such features: https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb (an alternative version: https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb).
I plan to do some Chinese named entity recogition. I want to know how to make features in this condition,like: "广东省中山市坦洲镇南坦路232号牡丹酒店"
In this sentence,each char should be "a word" in English. I have made training data like this:
But I don't know how to change it to
pycrfsuite.ItemSequence
.I want to add more connection features liketwo words before| target word |two words after
,and some position features likeis_head、is_end
.Could you show a demo ? Thank you. ^ ^