nod-ai / SHARK-TestSuite

Temporary home of a test suite we are evaluating
Apache License 2.0
2 stars 29 forks source link

Encode tokenizer outputs as utf8 before printing them. #247

Closed ScottTodd closed 4 months ago

ScottTodd commented 4 months ago

Fixes https://github.com/nod-ai/SHARK-TestSuite/issues/105

These tokenizers sometimes output characters that cause trouble when printing with the default encoding on Windows, so explicitly encode as utf8. Not all tests currently generate such characters, but the extra safety seems helpful.

Sample response from whisper-small: Response: b'<|startoftranscript|><|notimestamps|>What is nature of our existence?<|endoftext|>\xef\xbf\xbd\xef\xbf\xbd<|zh|>\xd0\xb7\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xe5\x84\x89\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd<|endoftext|>'

An alternate approach is to set the environment variable PYTHONIOENCODING=utf-8 or PYTHONUTF8=1.