m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.96k stars 1.26k forks source link

Long segments around 30 seconds. #362

Open twicer-is-coder opened 1 year ago

twicer-is-coder commented 1 year ago

image

The segments must be around between 1 - 5 seconds. The subtitles are not readable like this. I suspect due to this we are getting big segments. So I guess it should not be hardcoded but should be a option. I am not using any alignment model I am just transcribing. Is there any work around?

taucontrib commented 1 year ago

The pipeline first create batches around 30mins because thats the optimal length for whisper. Afterwards, it chunks those sentences with the alignment model into sentences. So your resulting segments should be shorter (i get 5-10secs on average).

jim60105 commented 1 year ago

@twicer-is-coder A new argument --chunk_size has been added at PR #445. Please check if this resolves your issue.