scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
770 stars 221 forks source link

add another format in ItemSequence #24

Closed Franck-Dernoncourt closed 9 years ago

Franck-Dernoncourt commented 9 years ago

["string_key1=string_value1", "string_key2=string_value2", ...] list is actually the format used in the example http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb

tpeng commented 9 years ago

thanks!

kmike commented 9 years ago

Isn't it the same as previous format?

["string_key1", "string_key2", ...] list; that's the same as {"string_key1": 1.0, "string_key2": 1.0, ...}

string_key1=string_value1 is not a special format, it is just a convention on how to create strings.

Franck-Dernoncourt commented 9 years ago

Hmm good point, I think you're right, sorry about that.

In that case though, in the CoNLL 2002 example, e.g. "word.isupper=True" is one binary feature, "word.isupper=False" is another binary feature --> shouldn't they be merged into same feature? It looks a bit inefficient and more importantly potentially misleading for readers (it led me to believe = would be parsed).

kmike commented 9 years ago

Hm, maybe you're right, but this is tricky. As you can see in example, positive and negative features didn't get equal weights, e.g.

 3.942852 O      word.istitle=False
-2.913103 O      word.istitle=True

Without word.istitle=False it won't be possible to assign negative weight to tokens which are non title-cased (because if you multiply anything by 0 you get 0), so this weight will be 'spread' over all other features. Including both features affects regularization (if I'm not mistaken word.istitle feature will be under-regularized with L2 penalty if there are both word.istitle=False and word.istitle=True). It looks like a model with a single feature is different from a model with two features. I don't know what is better though.

kmike commented 9 years ago

@tpeng is it OK to revert this change?

tpeng commented 9 years ago

sure! Go ahead

Mikhail Korobov notifications@github.com于Wed, Sep 16, 20159:47 PM写道:

@tpeng https://github.com/tpeng is it OK to revert this change?

— Reply to this email directly or view it on GitHub https://github.com/tpeng/python-crfsuite/pull/24#issuecomment-140867163.