Open Akz47 opened 6 months ago
To give you the most relevant advice, please explain your primary use case for detecting syllables or phonemes with WhisperX. For example, are you focusing on language learning, speech therapy, or another area? I want you to know that understanding your main objective will help us avoid the XY problem and offer more targeted assistance.
Thank you.
@SmartManoj Thank you for your reply.
We are experimenting with pronunciation improvement, detecting and analyzing how people pronounce syllables / phonemes. These sounds may not constitute a complete / actual word, and Whisper seems to output only recognized words.
For our use case, it might not be necessary to identify the spoken word, but rather, the syllables and corresponding timestamps.
We have also been scouting the web for alternative Python modules that can split speech audio into syllables, but have not come across a suitable option.
Can you please advise? Thanks.
You might find the My-Voice Analysis library useful for you. It's a Python library developed for voice analysis that can detect syllable boundaries in audio files without needing transcription. Here's the GitHub repository for more information and usage instructions: My-Voice Analysis.
@SmartManoj Thank you, that reference is really helpful. We will experiment with that module to generate the syllables and timestamps.
Just to reconfirm, can Whisper / Whisperx also be tweaked to detect syllables, or it only works with proper words?
I've also would be interested in phoneme-based timestamps.
we would also be interested in it , if it can be run locally and delivered the phonemes predictions similar to azure API
Also interested in getting phoneme level time stamps (for neuroscience of speech research)
May I know if there is a way to use WhisperX to generate timestamps of syllables or phonemes, instead of words detected Whisper model?
Our use case is to detect pronunciations / syllables in audio recordings, and sometimes words are not properly detected / omitted by Whisper (even for large models).
It would be helpful if we could obtain the syllable / phonetic timestamps, even if it is not a recognized word.
Thank you.