Open jmtatsch opened 1 year ago
In windows , I can confirm the output is DOS850 code page. You get :
"se├▒oras y se├▒ores"
instead of:
"señoras y señores"
"├▒" (in DOS850) === "ñ" (in utf8)
Thanks for your work.
I believe this is caused by this code line in talk-llama.cpp:
// remove all characters, except for letters, numbers, punctuation and ':', '\'', '-', ' '
text_heard = std::regex_replace(text_heard, std::regex("[^a-zA-Z0-9\\.,\\?!\\s\\:\\'\\-]"), "");
I suspect it would need to be extended with non-ascii letters to solve this issue.
Yes, that is indeed the culprit. After commenting it out it works as expected.
If I understand correctly that regex is there to filter out unprintable characters.
Maybe that could be done more elegantly with something like\p{C}
. I couldn't get it to work with c++ though
When using talk-lama with -I de I noticed that German umlaute ÄÖÜß are not transcribed properly
This should have been transcribed as:
Human: Der Führer führt überhaupt nicht gut.
This is probably of interest for most romance languages that have some special characters.
For Spanish the ñ is also missing:
Human: Me gustan las jalapeos.
Maybe there is UTF8 support missing somewhere?