twairball / fairseq-zh-en

NMT for chinese-english using fairseq
210 stars 49 forks source link

分词时报错 #6

Closed shuangzixing closed 5 years ago

shuangzixing commented 5 years ago

在news-commentary-v12.zh-en.en中,98000行左右有一段其他文字,编码方式不同,报错:UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 21: ordinal not in range(128) 请问这个怎么解决?

shuangzixing commented 5 years ago

处理中文时直接报错UnicodeEncodeError: 'ascii' codec can't encode character '\u5e74' in position 5: ordinal not in range(128)

shuangzixing commented 5 years ago

代码是不是应该改成f = open(data_filepath, 'w', encoding='utf-8')

twairball commented 5 years ago

yes