scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
770 stars 222 forks source link

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) #96

Open luvensaitory opened 6 years ago

luvensaitory commented 6 years ago

Details image

My data is about Chinese Math questions : 小蓉吃了8顆水餃,小宇吃了10顆水餃,誰吃的水餃比較多? ( )吃的多

And the training data is : 小 小蓉 人名 B-人名 蓉 小蓉 人名 E-人名 吃 吃 VC S 了 了 Di S 8 8 Neu S 顆 顆 Nf S 水 水餃 Na S 餃 水餃 Na S , , COMMACATEGORY S 小 小宇 人名 B-人名 宇 小宇 人名 E-人名 吃 吃 VC S 了 了 Di S 1 10 Neu S 0 10 Neu S 顆 顆 Nf S 水 水餃 Na S 餃 水餃 Na S , , COMMACATEGORY S 誰 誰 Nh S 吃 吃 VC S 的 的 DE S 水 水餃 Na S 餃 水餃 Na S 比 比較 Dfa S 較 比較 Dfa S 多 多 VH S ? ? QUESTIONCATEGORY S ( ( PARENTHESISCATEGORY S ) ) PARENTHESISCATEGORY S 吃 吃 VC S 的 的 DE S 多 多 VH S

umoqnier commented 5 years ago

I have the same problem with Otomí (mexican language). My Traceback looks like this

'ascii' codec can't encode character '\xe9' in position 8: ordinal not in range(128)

And the first three elements of xseq list looks like this:

[[b'bias', b'letterLowercase=d', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'BOW', b'nxtletter=<i', b'nxt2letters=<ig', b'nxt3letters=<ige', b'nxt4letters=<igeh'], [b'bias', b'letterLowercase=i', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'letterposition=-7', b'prevletter=d>', b'nxtletter=<g', b'nxt2letters=<ge', b'nxt3letters=<geh', b'nxt4letters=<geh\xc3\xb1'], [b'bias', b'letterLowercase=g', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'letterposition=-6', b'prev2letters=di>', b'prevletter=i>', b'nxtletter=<e', b'nxt2letters=<eh', b'nxt3letters=<eh\xc3\xb1', b'nxt4letters=<eh\xc3\xb1a']]

In previous step i try to do this for encoding but seems not works property:

featurelist.append([f.encode('utf-8') for f in features])
Weber12321 commented 1 year ago

Is this problem solved? I have the same problem with same error trying to train the NER model with Chinese too...