linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Improve Whisper transcription using transcript #81

Closed lenaten closed 1 year ago

lenaten commented 1 year ago

Thank you so much for sharing this amazing library with us.

Would it be possible to use existing transcript to improve Whisper transcription using forced alignment?

Jeronymous commented 1 year ago

Thanks :smile:

I am not sure to understand the suggestion.

What do you mean by "existing transcript"?

Do you suggest to finetune whisper with more accurate timestamps produced by whisper-timestamps?

darnn commented 1 year ago

If I understand correctly, since this is something I would want myself as well: Suppose you have an accurate transcript of some audio that you wish to turn into synched subtitles. Using Whisper (or Whisper-Timestamped etc.) you can obtain a subtitle file with timestamps, but with a less accurate transcript. The question is, is there any way to use forced alignment to match the accurate transcript to the subtitles with the accurate timing? Or, I don't know, really it might not need to involve forced alignment, it could just be a matter of matching the text in one file to the text in the other. Sadly while I can imagine it conceptually, I have no idea how to do any of this myself.

Jeronymous commented 1 year ago

@lenaten is that it?

If it is, there is this discussion: https://github.com/linto-ai/whisper-timestamped/discussions/49 There seems to be other repositories that just do that.

My main concern about aligning with Whisper is that Whisper is not the best option to align. wav2vec models are a better option, and they are also much cheaper to run. Also without having any hints about some initial timestamps (that Whisper transcription provides originally), it's not straightforward to adapt whisper-timestamps to do alignment (it's a quite different project).

Concerning the segmentation of the full text into chunks that make sense, I am wondering whether users would be interested in providing those chunks of text, or just providing the full text and let the model segment it. Those two use cases would involve different methods.

lenaten commented 1 year ago

Thanks @Jeronymous, That's what I was looking for.

This project provides the most accurate timestamps, but not the most accurate transcription, which is why I'm trying to combine the best of both worlds.

Jeronymous commented 1 year ago

@lenaten I am curious to know what you mean by "not the most accurate transcription". With option --accurate (and maybe additional ones like --vad) it's supposed to reach the same accuracy as OpenAI's whisper implementation, which is up to my knowledge the most accurate one for Whisper models (from what I see in my benchmarks). What do you use?

darnn commented 1 year ago

For the record, not having the most powerful CPU or GPU, I use this: https://github.com/Const-me/Whisper/ Which can actually run the large model (as opposed to the standard Whisper, Whisper-Timestamped etc.). I don't know how frequently it implements the more recent changes to Whisper's code, but whatever I lose there I more than make up for with the speed.

lenaten commented 1 year ago

@Jeronymous As compared to the original transcript, this is less accurate. If the original one can be used, it would be ideal.

Jeronymous commented 1 year ago

@lenaten I don't understand what you refer to as "original transcript". Is it a transcript made by somebody, or the one of openai-whisper?

(if it's the one of openai-whisper, you can get it with whisper-timestamped, just using the right options as I mentioned).

lenaten commented 1 year ago

@Jeronymous In the case of a song/movie, you may already know the right lyrics, but not the timestamps. To give good timestamps, wav2vec2 requires fine tuning per language. However, Whisper works well for multilingual tasks right out of the box

Jeronymous commented 1 year ago

Whisper works well for multilingual tasks right out of the box

Good point. Thanks for the clarification.

youkaclub commented 1 year ago

@Jeronymous the issue is closed but the need is open :) How can it be implemented with your library? I would love to contribute some code to make it happen

kitschpatrol commented 1 year ago

@youkaclub I had this same need, and managed to get per-word alignment from the perfect / hand-made transcription working with a different library, whisperX.

Basically I do it in a couple of passes.

  1. Get the automatic transcription (with errors) from the audio file via whisperX. This produces a JSON file with per-line chunked timing data, not per-word.

  2. Write a script to replace each line of the error-prone transcription from whisperX with the error-free hand-made transcript. It's important to retain the timing data generated by whisperX in the previous step. This was a little tricky to automate since there won't be a 1:1 correspondence between individual words and chunk length between the perfect transcript and the whisperX transcript. There are a bunch of viable strategies to tackle this, but I ended up using simple Levenshtein distance minimization to align the two corpora. This yields a JSON transcript file with chunk-level timing data, but with the text of the perfect transcript. (It's important to retain the basic structure of the whisperX transcript file, since the alignment step can only work on relatively small chunks of text at a time.)

  3. Feed the modified transcript file into whisperX's per-word alignment function. This yields a JSON file with per-word timings, wherein each word is exact match to the words in your prefect transcript.

And you're done!

youkaclub commented 1 year ago

Thanks @kitschpatrol. It is difficult to get good word level timestamps for non-English languages with whisperx since it requires good wav2vec2 models for each language while whisper has good multilingual support. As my main use case is singing alignment, I have trained wav2vec2 models for major languages and use forced alignment to match the original lyrics. Anyway it may be a good idea to auto create alignment dataset using your idea

jcuenod commented 8 months ago

@Jeronymous I am also interested in forced alignment. I guess this is closed "Won't Fix", but I'd like to add a +1 for this feature...