UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

luvensaitory commented 6 years ago

Details

My data is about Chinese Math questions : 小蓉吃了8顆水餃，小宇吃了10顆水餃，誰吃的水餃比較多？ ( )吃的多

And the training data is : 小小蓉人名 B-人名蓉小蓉人名 E-人名吃吃 VC S 了了 Di S 8 8 Neu S 顆顆 Nf S 水水餃 Na S 餃水餃 Na S ，， COMMACATEGORY S 小小宇人名 B-人名宇小宇人名 E-人名吃吃 VC S 了了 Di S 1 10 Neu S 0 10 Neu S 顆顆 Nf S 水水餃 Na S 餃水餃 Na S ，， COMMACATEGORY S 誰誰 Nh S 吃吃 VC S 的的 DE S 水水餃 Na S 餃水餃 Na S 比比較 Dfa S 較比較 Dfa S 多多 VH S ？？ QUESTIONCATEGORY S ( ( PARENTHESISCATEGORY S ) ) PARENTHESISCATEGORY S 吃吃 VC S 的的 DE S 多多 VH S

umoqnier commented 5 years ago

I have the same problem with Otomí (mexican language). My Traceback looks like this

'ascii' codec can't encode character '\xe9' in position 8: ordinal not in range(128)

And the first three elements of xseq list looks like this:

[[b'bias', b'letterLowercase=d', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'BOW', b'nxtletter=<i', b'nxt2letters=<ig', b'nxt3letters=<ige', b'nxt4letters=<igeh'], [b'bias', b'letterLowercase=i', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'letterposition=-7', b'prevletter=d>', b'nxtletter=<g', b'nxt2letters=<ge', b'nxt3letters=<geh', b'nxt4letters=<geh\xc3\xb1'], [b'bias', b'letterLowercase=g', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'letterposition=-6', b'prev2letters=di>', b'prevletter=i>', b'nxtletter=<e', b'nxt2letters=<eh', b'nxt3letters=<eh\xc3\xb1', b'nxt4letters=<eh\xc3\xb1a']]

In previous step i try to do this for encoding but seems not works property:

featurelist.append([f.encode('utf-8') for f in features])

Weber12321 commented 1 year ago

Is this problem solved? I have the same problem with same error trying to train the NER model with Chinese too...

scrapinghub / python-crfsuite

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) #96