Detecting Syllable / Phonetic Timestamps

m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

BSD 2-Clause "Simplified" License

11.93k stars 1.26k forks source link

Detecting Syllable / Phonetic Timestamps #762

Open Akz47 opened 6 months ago

Akz47 commented 6 months ago

May I know if there is a way to use WhisperX to generate timestamps of syllables or phonemes, instead of words detected Whisper model?

Our use case is to detect pronunciations / syllables in audio recordings, and sometimes words are not properly detected / omitted by Whisper (even for large models).

It would be helpful if we could obtain the syllable / phonetic timestamps, even if it is not a recognized word.

Thank you.

SmartManoj commented 6 months ago

To give you the most relevant advice, please explain your primary use case for detecting syllables or phonemes with WhisperX. For example, are you focusing on language learning, speech therapy, or another area? I want you to know that understanding your main objective will help us avoid the XY problem and offer more targeted assistance.

Thank you.

Akz47 commented 6 months ago

@SmartManoj Thank you for your reply.

We are experimenting with pronunciation improvement, detecting and analyzing how people pronounce syllables / phonemes. These sounds may not constitute a complete / actual word, and Whisper seems to output only recognized words.

For our use case, it might not be necessary to identify the spoken word, but rather, the syllables and corresponding timestamps.

We have also been scouting the web for alternative Python modules that can split speech audio into syllables, but have not come across a suitable option.

Can you please advise? Thanks.

SmartManoj commented 6 months ago

You might find the My-Voice Analysis library useful for you. It's a Python library developed for voice analysis that can detect syllable boundaries in audio files without needing transcription. Here's the GitHub repository for more information and usage instructions: My-Voice Analysis.

Akz47 commented 6 months ago

@SmartManoj Thank you, that reference is really helpful. We will experiment with that module to generate the syllables and timestamps.

Just to reconfirm, can Whisper / Whisperx also be tweaked to detect syllables, or it only works with proper words?

hollarob commented 6 months ago

I've also would be interested in phoneme-based timestamps.

jonaaathan commented 4 months ago

we would also be interested in it , if it can be run locally and delivered the phonemes predictions similar to azure API

kingjr commented 2 months ago

Also interested in getting phoneme level time stamps (for neuroscience of speech research)