jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

Is `model.align`'s results guaranteed to produce the same number of words as if you split by space? #330

Closed jimydavis closed 7 months ago

jimydavis commented 7 months ago

If I had audio that corresponded to text = "The brown fox, leapt over the dog." assuming its clean English speech, would model.align be guaranteed or probable to give back the same number of array elements of words as len(text.split()) ? In this case it should be 7 words. Assume also I am not using transcribe and I have the original transcript.

Thank you!

jianfch commented 7 months ago

Yes, it will be the case for "The brown fox, leapt over the dog.". The audio does not affect the way text is split into words but language, prepend_punctuations, and append_punctuations do.

For English, a simplify way for thinking of the splitting:

  1. text is split into words by space (while keeping the space at the beginning of each word)
  2. any punctuations in prepend_punctuations/append_punctuations that are not already part of a word either prepend/append to an adjacent word.

To replicate the exact process before calling align(), you can run these lines in align(): https://github.com/jianfch/stable-ts/blob/ad013d7f80de2b090ccfe967eb7801c8094cdf8a/stable_whisper/alignment.py#L227-L230

jimydavis commented 7 months ago

If the text was I love Ed's . cookies. how does it choose whether to attach the punctuation to Ed's or to cookies?

Thank you.

jianfch commented 7 months ago

If the text was I love Ed's . cookies. how does it choose whether to attach the punctuation to Ed's or to cookies?

It will attach to neither because the space before and after it. So it will be treated as its own word.