Closed tsmdt closed 1 month ago
Hi :)
Thank you.
Could you try installing this fork of transformers:
pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
and run it with that? That works alteast for me.
For your second question: The model is generally trained to listen to the primary speaker, meaning in training there were many audios that contain overlapped speech or speech where the main speaker ( usually the louder one) is transcribed only. This is because in our production settings this is what we want. Now in your audio it starts out with two different speakers talking.... What i guess is happening here is that the model gets confused who the main speaker is, starts transcribing the woman and assumes she is the main speaker and leaves out the man, later starting to transcribe him again ( in the beginning 30 second chunk) . I am currently in the process of upgrading CrisperWhisper generally, making it more verbatim with even better timestamps and training a version that is supposed to transcribe ,,everything" instead of focusing on the main speaker.
Hi,
thanks, I installed the fork and now everything works fine! 🙂 Regarding the second question: very interesting! I stumpled upon CrisperWhisper to mitigate the normalizations introduced via Whisper. A CrisperWhisper working for multiple overlapping speakers would be very useful indeed. Can you already say when the new version will be released?
Glad this fixed your issue. I would guess in around 6-8 Weeks.
Hi!
Thank you for your efforts, your model looks very promising! I ran into some problems using your
transformers
example as well as thefaster-whisper
implementation and would ne happy if you could help me out.transformers
Whenever I run your example code in Google colab on an A100 with my test files the result is the following index error; I tested 3 files that run withfaster-whisper
on "large-v2" with no problem:If you want to reproduce the error, find the files here: https://file.io/JqOvjkfeADNK
CrisperWhisper +
faster-whisper
I could get this implementation running for my test files, but the transcription is omitting some text chunks thatfaster-whisper
with "large-v2" is not:After the initial question "Uni Mannheim, sehr schwer gute Noten zu erreichen?" there is a whole chunk missing, that is present if I transcribe the file with
faster-whisper
and "large-v2":Do you have any ideas why the
faster-whisper
implementation is loosing those chunks? Any hints on resolving the IndexError for the transformers implementation?Thanks! 🙂