nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Re-train maxent_treebank_pos_tagger #3

Closed kmike closed 7 years ago

kmike commented 11 years ago

It currently doesn't unpickle under Python3.x. I guess this is because of http://bugs.python.org/issue6784 : Treebank corpus reader returned bytestrings under Python 2.x and the pickled classifier was trained on it; Python 3.x tries to decode them to unicode and this fails because the encoding is unknown. I think the way to fix this is to re-train the classifier on Python 2.x but with unicode strings as features; this should be backwards-compatible if I'm not mistaken.

alvations commented 7 years ago

Superseded by new averaged_perceptron_tagger