m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
10.22k stars 1.07k forks source link

Bulk inference via video2dataset #318

Open iejMac opened 1 year ago

iejMac commented 1 year ago

Hey! Thanks for this - works great. I'm working on implementing bulk whisper transcription in video2dataset and I was hoping to brainstorm on some optimization ideas for the case where n_audio_hours is very large. More specifically - the way I perform inference is across shards of data where each shard contains a few hours of audio i.e. we'd like to optimize for the case where we have access to a few hours of audio split among n files at a time.

Currently I set up a "slow" proof of concept for this - https://github.com/iejMac/video2dataset/blob/529a32627a13884658e546e0241f475ef78f38bc/video2dataset/subsamplers/whisper_subsampler.py#L53 Where each audio file is processed independently. If I understand correctly batched inference can be used here to speed things up considerably. Perhaps somehow loading the audio files, stitching them together, and processing them together is a good idea?

This is just one thought I had, if anyone has any better ideas I'd love to hear them and try it out.

Thanks again!

m-bain commented 1 year ago

Makes sense, definitely stitching audio back together needed otherwise the audio could split the middle of a word.

I guess there is a usecase where the dataset consists of many short videos (30s or less). Then the batching could be processing a queue of audio chunks (that can come from different videos)

sorgfresser commented 1 year ago

Regarding the stitching: if we just want to stitch some very long transcriptions it should be feasible to manually create an overlap of j seconds. Afterwards you can remove this overlap using Longest Common Subsequence (or substring with k mismatches?). I know Huggingface implemented this in their ASRPipeline. I'm thinking about dropping the last / first word of each chunk (since this could be the middle of a word so it would be off) and then using Longest Common Subsequence to find the overlapping part.

If you want it to be very advanced, you could also ensure that the first word of a chunk is not used in a prompt (since it could be the middle of the word) but this would require some additional logic i.e. a custom transcribe function.