These tokenizers sometimes output characters that cause trouble when printing with the default encoding on Windows, so explicitly encode as utf8. Not all tests currently generate such characters, but the extra safety seems helpful.
Sample response from whisper-small: Response: b'<|startoftranscript|><|notimestamps|>What is nature of our existence?<|endoftext|>\xef\xbf\xbd\xef\xbf\xbd<|zh|>\xd0\xb7\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xe5\x84\x89\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd<|endoftext|>'
Fixes https://github.com/nod-ai/SHARK-TestSuite/issues/105
These tokenizers sometimes output characters that cause trouble when printing with the default encoding on Windows, so explicitly encode as utf8. Not all tests currently generate such characters, but the extra safety seems helpful.
Sample response from whisper-small:
Response: b'<|startoftranscript|><|notimestamps|>What is nature of our existence?<|endoftext|>\xef\xbf\xbd\xef\xbf\xbd<|zh|>\xd0\xb7\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xe5\x84\x89\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd<|endoftext|>'
UnicodeEncodeError: 'charmap' codec can't encode characters in position 82-83: character maps to <undefined>
(those are the\xef\xbf
characters, unicode "replacement characters": https://stackoverflow.com/a/11162470, https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29)An alternate approach is to set the environment variable
PYTHONIOENCODING=utf-8
orPYTHONUTF8=1
.