rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

solve unicode errors when reading the bpe codes file #92

Closed yimmon closed 4 years ago

yimmon commented 4 years ago

codecs considers some special Unicode characters, such as \u2028, \u2029, as new line separators. This may lead to errors when loading the bpe codes file.

For example, in the following file, . \u2028</w> will be read as two lines, raising the error "The line should exist of exactly two subword units, separated by whitespace".

se u
. \u2028</w>
p as
rsennrich commented 4 years ago

thanks!