Use Whispertimestamped package instead to Fix the mismatch between transcribed text and timestamp from Whisper Outputs
Merge duplicated text across continous timestamps together to increase the embedding size of the speaker
Create unique id of each segment output from Whisper for the easier merge with other outputs: transcription text is not unique id. Same sentence, like "okay", may be spoken multiple times. Segment id would ensure faster merge based on id number instead of text