princeton-ddss / SpeechMLPipeline

SpeechMLPipeline is a complete pipeline to deploy Machine Learning Models to generate labelled and timestamped transcripts from audio inputs
MIT License
0 stars 1 forks source link

Open AI Whisper for Audio-to-Text Transcription #12

Closed fjying closed 1 year ago

fjying commented 1 year ago

Need to produce outputs of the beginning time and the end time of each word instead of each sentence

This feature is not supported by Whisper: https://github.com/openai/whisper/discussions/332

Whisper only could identify the timestamps of each segmentation in terms of sentences

fjying commented 1 year ago

No need to have timestamps of each word. The text segmentation from Whisper is the subset of the speaker segmentation. Thus, we could apply the rolling window of the text segmentations to detect the change of speaker.

fjying commented 1 year ago

Run Time Estimate: 77seconds for 93 seconds video under one GPU, without parallel processing