Character encoding issue with Vosk Transcription output using other languages

wladimirleite commented 4 months ago

A user asked me for help setting up Vosk to process audio files from Russian speakers. Setting the language in AudioTranscriptionConfig.txt to "ru" and downloading the model from https://alphacephei.com/vosk/models, as indicated in the configuration file worked out of the box.

I downloaded a few public Russian audios to test, and everything worked fine, except for a small detail that some characters were messed up (something related to character encoding). For example: "скорее исключение чем правило" was shown as "�?корее и�?ключение чем правило".

Tracing down the issue, I found the following line that "re-encodes" Vosk output string (json variable is a String, which contains Vosk output). https://github.com/sepinf-inc/IPED/blob/0b65e3bdbe981628885d746b4b49d6a2149624c3/iped-engine/src/main/java/iped/engine/task/transcript/VoskTranscriptTask.java#L142 @lfcnassif, I guess this was added to workaround the "incorrect" way non ascii characters are outputted by Vosk in Windows, right?

Although it works in most cases, it fails for some non-latin characters (like the cyrillic "с"). Setting the default encoding to UTF-8 (there are a few ways to do that, like command line parameter "-Dfile.encoding=UTF8"), solved the issue (and there would be no need to re-encode the output JSON). @lfcnassif, do you see any problem setting this as default?

lfcnassif commented 4 months ago

Hi @wladimirleite!

@lfcnassif, I guess this was added to workaround the "incorrect" way non ascii characters are outputted by Vosk in Windows, right?

I don't remember for sure, but I think yes, it looks like so.

@lfcnassif, do you see any problem setting this as default?

Well, since it's a global setting, I think it would be good to review some new String(), new FileReader() and other similar calls through the code that use the system default charset under the cover. Actually we should always avoid those calls when possible...

wladimirleite commented 4 months ago

Actually we should always avoid those calls when possible...

Indeed. And if there are such calls, in theory they would be better using UTF-8 than CP-1252.

wladimirleite commented 3 months ago

As far as I understand, the default charset will be UTF-8 in newer JDKs. https://openjdk.org/jeps/400

Anyway, I found a dozen places in the code or so that rely on default charset. Most of them can be changed without any impact. However there are a few points that would need more changes (or at least more careful tests).

I think it is better to leave this as it is for now.

sepinf-inc / IPED

Character encoding issue with Vosk Transcription output using other languages #2115