codecs considers some special Unicode characters, such as \u2028, \u2029, as new line separators.
This may lead to errors when loading the bpe codes file.
For example, in the following file, . \u2028</w> will be read as two lines, raising the error "The line should exist of exactly two subword units, separated by whitespace".
codecs
considers some special Unicode characters, such as \u2028, \u2029, as new line separators. This may lead to errors when loading the bpe codes file.For example, in the following file,
. \u2028</w>
will be read as two lines, raising the error "The line should exist of exactly two subword units, separated by whitespace".