Closed wladimirleite closed 3 months ago
Hi @wladimirleite!
@lfcnassif, I guess this was added to workaround the "incorrect" way non ascii characters are outputted by Vosk in Windows, right?
I don't remember for sure, but I think yes, it looks like so.
@lfcnassif, do you see any problem setting this as default?
Well, since it's a global setting, I think it would be good to review some new String(), new FileReader() and other similar calls through the code that use the system default charset under the cover. Actually we should always avoid those calls when possible...
Actually we should always avoid those calls when possible...
Indeed. And if there are such calls, in theory they would be better using UTF-8 than CP-1252.
As far as I understand, the default charset will be UTF-8 in newer JDKs. https://openjdk.org/jeps/400
Anyway, I found a dozen places in the code or so that rely on default charset. Most of them can be changed without any impact. However there are a few points that would need more changes (or at least more careful tests).
I think it is better to leave this as it is for now.
A user asked me for help setting up Vosk to process audio files from Russian speakers. Setting the language in AudioTranscriptionConfig.txt to "ru" and downloading the model from https://alphacephei.com/vosk/models, as indicated in the configuration file worked out of the box.
I downloaded a few public Russian audios to test, and everything worked fine, except for a small detail that some characters were messed up (something related to character encoding). For example: "скорее исключение чем правило" was shown as "�?корее и�?ключение чем правило".
Tracing down the issue, I found the following line that "re-encodes" Vosk output string (
json
variable is aString
, which contains Vosk output). https://github.com/sepinf-inc/IPED/blob/0b65e3bdbe981628885d746b4b49d6a2149624c3/iped-engine/src/main/java/iped/engine/task/transcript/VoskTranscriptTask.java#L142 @lfcnassif, I guess this was added to workaround the "incorrect" way non ascii characters are outputted by Vosk in Windows, right?Although it works in most cases, it fails for some non-latin characters (like the cyrillic "с"). Setting the default encoding to UTF-8 (there are a few ways to do that, like command line parameter "-Dfile.encoding=UTF8"), solved the issue (and there would be no need to re-encode the output JSON). @lfcnassif, do you see any problem setting this as default?