Closed robinderat closed 2 months ago
That is expected IMO: how do you choose which language to return? Do you return a language for each chunk? For each text between timestamps? Which language do you return the one that is the most used? I think these where considerations. Also getting the language first should be pretty easy.
cc @ylacombe as well
I have implemented the 2nd approach I outlined in a fork of the project (which is not ideal) and it's working just fine. In my opinion, if you can predict the language for a chunk, there is no harm is saying all words in that chunk belong to that language.
If there is an easy way to get the language first that would be fine too, but I don't see how besides running the model twice, which is slow and wasteful
Hi @robinderat,
Thank you so much for this question!
This is indeed a bug, and fixing it would be a very nice contribution. The second approach you suggest looks good as it would keep the structure of the output and just add the missing attributes.
Would you like to open a PR for this fix, given that you have implemented it in a forked project?
This would be a really valuable fix for the Whisper community :)
System Info
transformers
version: 4.38.2Who can help?
@sanchit-gandhi @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am using the automatic speech recognition pipeline with whisper large v3. I want to get both the language as detected by whisper and the timestamps for the individual words. Both behaviors are supported individually( return_language=True, return_timestamps="word"), however when combined, the language is no longer returned.
This is the code i am using:
I have done some digging in the source code and I believe I have found the problem in tokenization_whisper.py
Here, when using word timestamps, the existing chunks containing the language are ignored and only the words in the chunk are returned .
Expected behavior
I expect to get both the language and the word timestamps when using
return_language=True
andreturn_timestamps="word"
Considering the source code mentioned above, I see 2 possible ways of achieving this.
1) Add an extra list to the chunks when using return_timestamps='word'
This would result in the format
2) Add the language information to the words
This would maintain the current structure and simply add the language of the chunk to its words, giving the following format
I believe solution 1 could be achieved simply by removing the
if return_timestamps == "word":
statement completely. I believe solution 2 could be achieved by replacing thenew_chunks.extend(chunk["words"])
with