Closed conan1024hao closed 10 months ago
I haven't seen this in my experiment. Have you tried using varying amount of words (for example using more than 10 words in the prompt), and see if the issue still exists.
I haven't seen this in my experiment. Have you tried using varying amount of words (for example using more than 10 words in the prompt), and see if the issue still exists.
I have tried varying the number of words from 5 to 30; basically, more words make results worse. In the paper, you said even 90 words will not hurt results, that's really surprising. There may be two reasons.
Anyway, thank you for your reply. I'll update here if I make any new developments.
Very interesting!
@jasonppy
My "separately audio" indicates cases like:
utt1 Hi my name is David, I like
utt2 eating apple.
The silence issue was solved after merging these separated utterances into one utterance!
Hi, thank you for your interesting work.
I am trying to apply your method to a custom AVSR dataset.
After applying prompt (about ten words, looks like
word1, word2, ..., word10
), Whisper did recognize some words more correctly, however, the results' WER became about twice. I found that some segments, which Whisper can recognize correctly before applying the prompt, are completely silence now, which made WER so high. Possibly relevant discussion: https://github.com/openai/whisper/discussions/1594Do you have any insights of this? Thank you.