patrickschu / textgrid-convert

textgrid-convert converts audio transcripts such as sbv or srt files to Praat and DARLA compatible TextGrids.
9 stars 1 forks source link

UnicodeDecodeError: 'cp950' codec #34

Closed hey0wing closed 3 years ago

hey0wing commented 3 years ago
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe8 in position 32: illegal multibyte sequence

Please specify the encoding="utf-8" in convert_to_txtgrid function

patrickschu commented 3 years ago

Hi @hey0wing thanks for filing, trying to replicate, can you specific the following:

If so, we might be able to address. Thanks

hey0wing commented 3 years ago

OS: Windows, Usage: python -m textgrid_convert -i transcription.srt

The problem: I was trying to convert a srt with Chinese text (utf-8 based) to textgrid, but the convert_to_txtgrid function did not specify encoding="utf-8 in with open(file) as sourceFile:. Therefore, the default open treat Chinese character as cp950 instead of utf-8.

After I failed to convert the file, I uninstalled the package already and sorry that the problem is not replicable. Thanks!

patrickschu commented 3 years ago

Could not replicate -- all file operations I could find (grep) use an explicit utf-8 encoding