jasonppy / PromptingWhisper

Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation
132 stars 11 forks source link

Prompt makes more segments silence #4

Closed conan1024hao closed 10 months ago

conan1024hao commented 10 months ago

Hi, thank you for your interesting work.

I am trying to apply your method to a custom AVSR dataset.

After applying prompt (about ten words, looks like word1, word2, ..., word10), Whisper did recognize some words more correctly, however, the results' WER became about twice. I found that some segments, which Whisper can recognize correctly before applying the prompt, are completely silence now, which made WER so high. Possibly relevant discussion: https://github.com/openai/whisper/discussions/1594

Do you have any insights of this? Thank you.

jasonppy commented 10 months ago

I haven't seen this in my experiment. Have you tried using varying amount of words (for example using more than 10 words in the prompt), and see if the issue still exists.

conan1024hao commented 10 months ago

I haven't seen this in my experiment. Have you tried using varying amount of words (for example using more than 10 words in the prompt), and see if the issue still exists.

I have tried varying the number of words from 5 to 30; basically, more words make results worse. In the paper, you said even 90 words will not hurt results, that's really surprising. There may be two reasons.

Anyway, thank you for your reply. I'll update here if I make any new developments.

jasonppy commented 10 months ago

Very interesting!

  1. Since you mentioned that you are not using OCR, one way to isolate the issue is to use ground truth text as prompt
  2. not sure I really understand what is "separately audio", do you mean there are a lot of silences in one utterance? If that's the case, indeed whisper might be having trouble with that. One thing you could try is to connect the words in your original prompt into sentences using template, like the approach used in Socratic Model (see fig.3 left for an example), maybe whisper can do better if the prompt is a complete sentence.
  3. another thing you could try is to finetune whisper for ASR with prompt (kinda like instruction finetuning used in NLP)
conan1024hao commented 10 months ago

@jasonppy

My "separately audio" indicates cases like:

utt1 Hi my name is David, I like
utt2 eating apple.

The silence issue was solved after merging these separated utterances into one utterance!