scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
770 stars 221 forks source link

Question in the example (X_test declaration and Tagger's behavior). #49

Closed jbkoh closed 7 years ago

jbkoh commented 7 years ago

Hi,

I am new to CRF and trying to use this library for the entire project. I have a question in the example (https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb).

When X_test is generated, the same feature generator (sent2features(sent)) used in the training stage is used. It includes features related to labels, which should be assumed to be unknown at the test stage. So I made another feature generator called sent2features_without_labels that removed features related to postags. It shows a lower accuracy in general in the code, which means that the features related to labels are being used at the test stage.

Is this correct behavior? I would like to learn if I misunderstood anything. Also, in my understanding, shouldn't the raw sentence be the inputs to Tagger.tag() other than feature vectors? (I checked the original CRFSuite (C++) and the author said the labels in the test dataset is ignored, but haven't tested it by myself.)

Thanks for the sharing this project and hope I can learn the entire package fast.

jbkoh commented 7 years ago

Otherwise, isn't optimal inference not implemented? (should I add it?)

kmike commented 7 years ago

Hey @jbkoh,

When X_test is generated, the same feature generator (sent2features(sent)) used in the training stage is used. It includes features related to labels, which should be assumed to be unknown at the test stage.

I think the example could be a bit confusing: POS tags are not labels, they are just additional features which are available in the dataset. Lables are BIO tags for named entities; they are not used in X. We could have used Spanish POS tagger instead of these precomputed POS tags - and this is what we should have done if the goal is to use this extractor for real-world tasks, not to check how the method works on test data. But I believe the example is correct and there is no leak.

So I made another feature generator called sent2features_without_labels that removed features related to postags. It shows a lower accuracy in general in the code, which means that the features related to labels are being used at the test stage.

If you remove an useful feature quality would get lower, that's the point of using a feature :)

Also, in my understanding, shouldn't the raw sentence be the inputs to Tagger.tag() other than feature vectors? (I checked the original CRFSuite (C++) and the author said the labels in the test dataset is ignored, but haven't tested it by myself.)

Tagger.tag takes sequences of feature vectors as input, why does it need raw sentences? Raw sentences are of no use for tagger because there are many ways to convert raw sentence to features. I'm not sure I understand what you mean.

jbkoh commented 7 years ago

Thanks for the quick response, @kmike.

I see that I misunderstood the dataset. I thought the POS tags are labels. Then, I should change my question like following:

In CRF learning, previous word's label can be used for features like f_k(y_i-1, y_i, x_i). How can I add such features related to labels to the Trainer? Also, how can I add such features to inputs of Tagger.tag()? If Tagger.tag() only received features already determined as numbers (or boolean), it cannot express features related to labels in my understanding.

Thanks again for the response and sharing and I would like to learn more!

kmike commented 7 years ago

In CRF learning, previous word's label can be used for features like f_k(y_i-1, y_i, x_i). How can I add such features related to labels to the Trainer? Also, how can I add such features to inputs of Tagger.tag()? If Tagger.tag() only received features already determined as numbers (or boolean), it cannot express features related to labels in my understanding.

CRFSuite doesn't support arbitrary f_k(y_i-1, y_i, x_i) features; it implements two kinds of features: I(y_i=a)*f(xseq) (called "state features") and I(y_i=a)*I(y_i-1=b) ("transition features"). f(xseq) is what you pass to tagger/trainer; I(y_i=a)*f(xseq) and I(y_i=a)*I(y_i-1=b) are auto-generated for all possible labels a and b.

It is not possible to condition transition probability on input in CRFsuite, i.e. I(y_i=a)*I(y_i-1=b)*f(xseq) features are not implemented. If you need them then you'll have to use another crf package, e.g. wapiti (and python-wapiti).

jbkoh commented 7 years ago

I got it. I will compare the results with/without conditional transition and choose a library. Thank you for the kind explanation!