Closed jaggzh closed 5 months ago
cc @kamilakesbi as well! @jaggzh we are going to need an audio we can work with together, and if you can reduce the reproducer to a minimal amount of custom code would be great!
Hey @jaggzh - thanks for reporting. This is actually the intended behaviour with Whisper. To understand why, recall that Whisper predicts the distribution over the next token $y{i}$ conditionally over all previous tokens $\boldsymbol{y}\{0:i-1}$:
$$ y{i} \sim P\left(y | \boldsymbol{y}\{0:i-1}\right) $$
When we decode without timestamps, we generate sequences with the following format:
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> The cat sat on the mat.<|endoftranscript|>
Note the task token at index 4: the <|notimestamps|>
indicates to the model that it should not predict timestamps.
To decode with timestamps, we ensure that the <|notimestamps|>
is not generated at position 4, which triggers the model to predict with timestamp tokens:
<|startoftranscript|> <|en|> <|transcribe|> <|0.00|> The cat sat on the mat.<|4.22|><|endoftranscript|>
=> we can see here that the sequence of token ids changes in two ways:
<|notimestamps|>
token from position 4)<|0.00|>
and <|4.22|>
). The key here is understanding that these timestamp tokens are predicted in the same way as the text tokens: auto-regressively based on the conditional probability distribution over previous tokens.Since the sequence of token ids $\boldsymbol{y}_{0:i-1}$ changes, the predictions for token $y_{i}$ also change (by nature of the conditional probability distribution that we predict). Therefore, it's possible that the generations with timestamps differ from those without timestamps.
Generally, what we observe is that enabling timestamps gives less accurate transcriptions for short-form audio, and more accurate for long-form audio (whether you're using the chunked or sequential decoding algorithms).
Closing the issue since it is in-fact the intended behaviour from Whisper, but happy to answer any follow-up questions you have! Feel free to post on this comment thread 🤗
2. We predict timestamp tokens as part of the generated sequence (in this example, `<|0.00|>` and `<|4.22|>`). The key here is understanding that these timestamp tokens are predicted in the same way as the text tokens: auto-regressively based on the conditional probability distribution over previous tokens.
Thank you so much for the extremely helpful and detailed explanation! Are our timestamp tokens then initially generated during training, and are they actually {.02f}? (I'm dealing with short-form audio, from maybe .3 to 6s max (and very few samples reach above 1s -- it's for someone with speech issues). If I were to give up accuracy in the timestamp, like .1f, it might help the model have less variation in the timestamp tokens, and an easier time learning higher accuracy [I'm thinking]. The actual main goal of mine is not to get the accuracy of the timestamps -- they can be rough -- but not to damage the transcription accuracy [much] in the process.
Nevertheless, since it's short-form disjoint speech I began working on a project that does some nice automatic breaking up of audio with auto-calibrated silence detection -- and that's a module that operates as a generator function, returning the clip and the time offset, so I can use it in different projects (including my data prep OR prediction code). Thus, with such short utterances, I'm able to then get the timestamp of each clip and that'll be sufficient for my needs.
(It's not on topic, but if anyone's interested (not that they'll see this closed issue))... You can find it here: https://gist.github.com/jaggzh/e9a5b31afc218b8d44fd5ddb976c8c96 (If run directly it'll accept an audio file to test ones settings), but I didn't incorporate arg parsing so one has to modify the code to evaluate them).
It handles evaluating a provided audio file (file only right now.. can't yet use it on a live audio stream). It examines requested seconds of audio (chunk) and, within that small examination windows for each of their max amplitudes. (It considers the lowest of those as the noise floor). It then evaluates the max it heard (discards some (maxamp_discard_frac)), to take a fraction between the floor and that max as the acceptable signal (voice) level.
The purpose was to automatically adjust, instead of using fixed dB of many solutions I found.
If plotting, it ends up using my non-breaking key module (kbnb) -- that import can just be left out if not using it. Otherwise that's included in the gist, along with bansi.py for some perdy colors also used in the plotting.
In any case, it's also a good example of matplotlib running and updating its window in the bg, non-blocking. :)
I have a new idea, since timestamps are useful, and accuracy is useful. If we remove [some] timestamps from the recurrent context, while maintaining them in the output to the user, we might be able to maintain accuracy. I'm not very-well-aware of the caching mechanisms involved, but the idea would be something like this: Input audio: "Hello world" Starting context: <|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> [Model predicts "Hello" token with high accuracy (and is therefore at <|0.00|>)] -> <|startoftranscript|> <|en|> <|transcribe|> Hello [We insert <|0.00|>] Recurrent context: <|startoftranscript|> <|en|> <|transcribe|> <|0.00|>Hello [Model predicts "<|0.04|>"] -> <|startoftranscript|> <|en|> <|transcribe|> Hello<|0.04|> [If it was a timestamp token, we keep it for the 'Context-to-User', but strip it from recurrent context:] Recurrent context: <|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Hello [Model predicts " world" (high accuracy)] -> <|startoftranscript|> <|en|> <|transcribe|> Hello world Recurrent context: <|startoftranscript|> <|en|> <|transcribe|> <|0.00|>Hello OR: Recurrent context: <|startoftranscript|> <|en|> <|transcribe|> Hello<|0.04|> world (I'm not sure how happy the model will be without the initial 0.00).
Two possible variations:
By using a dynamic stripping, we can choose, each pass, which timestamp tokens we keep, with the idea being that the attention head(s) can match up enough of the audio features to transcription tokens to maintain the next token accuracy. When we expect a timestamp token we can include a prior timestamp closer to the last token.
(We could also attempt to force a timestamp or token prediction, as needed, with prefix_allowed_tokens_fn, for example. But this could either be optional or an experimental part of the algorithm -- or with an adjustable token spacing.)
@sanchit-gandhi @ArthurZucker @younesbelkada
That's correct @jaggzh - the model is trained to predict timestamps to 0.2f precision during training. See page 3 of the whisper paper for details.
Changing the precision of the timestamps is unlikely to get you any improvements in transcription accuracy. In fact, you risk potentially lower timestamp accuracy as you generate since you deviate away from the most probably predictions.
Regarding modifying the decoding algorithm: If you want to be able to predict timestamps at index i
, then you need to have predicted timestamps for indices 0:i-1
. If you pass in previous ids that don't have timestamps, I'm pretty sure you'll mess up the predictions for index i
.
One option to try and get the best of both worlds is what they do in Whisper-X - use Whisper for the transcriptions, but wav2vec2 for the timestamps.
I'm so sorry -- I'm referring to the accuracy of transcription being maintained, not timestamp accuracy changed. The idea is to use a pass with the notimestamps token, and likely with a dynamic context (stripping prior timestamps so as not to confuse the model with timestamps in the context), and then passes WITH the request for timestamp, with timestamps included (or possibly partial timestamps included -- this would need to be tested, and may vary based on whether the user is aiming for high resolution (word-based) timestamps or not).
System Info
transformers
version: 4.40.2Who can help?
@sanchit-gandhi @ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
return_timestamps=True
in the generate() callreturn_timestamps=True
Expected behavior
The text with and without timestamps "should" match, no? But with timestamps it somehow interferes, changing the text and, in this case, decreasing its accuracy.
This is a fine-tuned model, with a complex voice (patient whispers, breathing on a ventilator), and so far with insufficient data for better training. My point here is that I believe the model will therefore be more susceptible to influences that can deteriorate its recognition. However, my main questions are:
generation(..., return_timestamps=True)
end up affecting the whole process?My code (it's a bit of a mess as I experiment):
With generate()'s
return_timestamps=True
:Predicted id [0] text: <|startoftranscript|><|en|><|transcribe|> There is a time... ...of a subconscious development. It don't work. Bureau work. The branch, the branch.<|endoftext|>
Predicted id [0] offsets: [{'text': ' There is a time...', 'timestamp': (0.0, 2.6)}, {'text': ' ...of a subconscious development.', 'timestamp': (14.6, 17.6)}, {'text': " It don't work.", 'timestamp': (20.6, 22.6)}, {'text': ' Bureau work.', 'timestamp': (23.400000000000002, 24.400000000000002)}, {'text': ' The branch, the branch.', 'timestamp': (25.6, 27.6)}]
Without generate()'s
return_timestamps=True
:Predicted id [0] text: <|startoftranscript|><|en|><|transcribe|><|notimestamps|> there is it time... what is that chin? round one? you know what? the brown strap is... the brown strap is...<|endoftext|>
Predicted id [0] offsets: []
Full code below. (Please don't look at it unless you have to!)