ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.1k stars 3.58k forks source link

Special characters ÄÖÜßñ are not transcribed properly #923

Open jmtatsch opened 1 year ago

jmtatsch commented 1 year ago

When using talk-lama with -I de I noticed that German umlaute ÄÖÜß are not transcribed properly

main : done! start speaking in the microphone

Human: Der Fhrer fhrt berhaupt nicht gut.
LLaMA:

This should have been transcribed as: Human: Der Führer führt überhaupt nicht gut.

This is probably of interest for most romance languages that have some special characters.

For Spanish the ñ is also missing: Human: Me gustan las jalapeos.

Maybe there is UTF8 support missing somewhere?

j0t4 commented 1 year ago

In windows , I can confirm the output is DOS850 code page. You get :

"se├▒oras y se├▒ores"

instead of:

"señoras y señores"

"├▒" (in DOS850) === "ñ" (in utf8)

Thanks for your work.

petterreinholdtsen commented 7 months ago

I believe this is caused by this code line in talk-llama.cpp:

// remove all characters, except for letters, numbers, punctuation and ':', '\'', '-', ' '
text_heard = std::regex_replace(text_heard, std::regex("[^a-zA-Z0-9\\.,\\?!\\s\\:\\'\\-]"), "");

I suspect it would need to be extended with non-ascii letters to solve this issue.

jmtatsch commented 7 months ago

Yes, that is indeed the culprit. After commenting it out it works as expected.

If I understand correctly that regex is there to filter out unprintable characters. Maybe that could be done more elegantly with something like\p{C} . I couldn't get it to work with c++ though