Closed fjying closed 1 year ago
No need to have timestamps of each word. The text segmentation from Whisper is the subset of the speaker segmentation. Thus, we could apply the rolling window of the text segmentations to detect the change of speaker.
Run Time Estimate: 77seconds for 93 seconds video under one GPU, without parallel processing
Need to produce outputs of the beginning time and the end time of each word instead of each sentence
This feature is not supported by Whisper: https://github.com/openai/whisper/discussions/332
Whisper only could identify the timestamps of each segmentation in terms of sentences