Process Whisper Raw Output to Improve the Accuracy of Speaker Diarization

princeton-ddss / SpeechMLPipeline

SpeechMLPipeline is a complete pipeline to deploy Machine Learning Models to generate labelled and timestamped transcripts from audio inputs

MIT License

0 stars 1 forks source link

Closed fjying closed 7 months ago

fjying commented 10 months ago

Use Whispertimestamped package instead to Fix the mismatch between transcribed text and timestamp from Whisper Outputs
Merge duplicated text across continous timestamps together to increase the embedding size of the speaker
Create unique id of each segment output from Whisper for the easier merge with other outputs: transcription text is not unique id. Same sentence, like "okay", may be spoken multiple times. Segment id would ensure faster merge based on id number instead of text

fjying commented 10 months ago

Merge duplicated text across continous timestamps together:

Before Merge:

After Merge: