tazz4843 / whisper-rs

Rust bindings to https://github.com/ggerganov/whisper.cpp
The Unlicense
607 stars 105 forks source link

Invalid utf-8 #115

Closed thewh1teagle closed 2 months ago

thewh1teagle commented 5 months ago

image The song in hebrew: muminim hebrew.zip

tazz4843 commented 5 months ago

This is an upstream issue, not something we can control. I run into this myself with my own services, and I just log it and ignore the output.

Doing some digging I found the following: https://github.com/ggerganov/whisper.cpp/issues/1098 https://github.com/ggerganov/whisper.cpp/pull/1118

thewh1teagle commented 5 months ago

@tazz4843 How can I ignore the errors and take only some of the transcribed data? or if it's in some languages it won't work at all? I can't transcribe in some langauges at all.

thewh1teagle commented 5 months ago

I checked whisper.cpp with his cli example. He has that issue there too but in terminal only. If I write the output of whisper.cpp to file it works well, So I think it's still encoding issue in whisper-rs It happens here whisper_state.rs#L481

tazz4843 commented 5 months ago

We don't do anything with the string, this would be a bug in Rust's std string library, which there's essentially no chance of. As such this means it must be whisper.cpp returning an invalid UTF-8 string. We could return the raw bytes on error, but those are somewhat useless without being able to parse it unless you want to parse only up to the index where it fails (which would be a valid use case and if you want this added I can do so).

magnus-ISU commented 3 months ago

UTF-8 is designed specifically to be able to recover from invalid strings, right?

image

You could discard whatever is invalid (seems best to me); or as this crate (I think — it is dense and I didn't care to verify after glancing at the code) does, return invalid codepoints as valid UTF-8 had their prefixes been right.

0xxxxxxx -> great, we're back to ASCII, continue
10xxxxxx -> crap, invalid
110xxxxx -> great, back to valid input
10xxxxxx  -> end of the last char
10xxxxxx -> invalid
11110xxx -> start of 4 byte char
11xxxxxx -> invalid
11110xxx -> start of 4 byte char
10xxxxxx
10xxxxxx
10xxxxxx -> end of valid 4 byte char

you could still parse out of there 0xxxxxxx 110xxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxxx 10xxxxxx and assuming what you had was 1 invalid codepoint and a ton of crap, it will probably be fine.

tazz4843 commented 3 months ago

There is String::from_utf8_lossy for that which does throw away information to get a valid UTF-8 string

thewh1teagle commented 2 months ago

I still experience this issue, I'm not sure wether it's in my control or whisper-rs need to be changed https://github.com/thewh1teagle/vibe/issues/34 Can I ignore these utf-8 errors?

tazz4843 commented 2 months ago

Remind me in a few days and I can add a function to infallibly convert.

thewh1teagle commented 2 months ago

Hey, just a reminder Many people opened issue related to that in vibe/issues so I hope to solve it. I think that it's better to receive some invalid characters than fail the whole transcription

tazz4843 commented 2 months ago

Should be solved in f4ea0d97e48fb12f97755b4fbc2813670e177afc

thewh1teagle commented 2 months ago

Should be solved in f4ea0d9

Thanks, I wasn't able to use it but it helped me understand where is the problem so I added github.com/thewh1teagle/whisper-rs/ee93930 and looks like it fixed the issue (and I don't even see invalid characters). I can create PR from that if you want :)