metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS
https://themetavoice.xyz/
Apache License 2.0
3.54k stars 621 forks source link

Possible to get timing info? #14

Open benjismith opened 5 months ago

benjismith commented 5 months ago

Is it possible to have this model also generate millisecond-level timestamps for the words (or phonemes) in the prompt?

I currently use speech-marks from a AWS Polly, and if this model could generate the same format, that would be very helpful!

sidroopdaska commented 5 months ago

That's interesting, can you share more details about your use-case?

benjismith commented 5 months ago

I make a cloud writing platform for fiction authors (https://shaxpir.com). I'm working on a new "read aloud" feature where authors can highlight a few paragraphs of text and hear it read aloud to them by an AI voice.

Each word in the selection is highlighted as the voice reads the text, so that the author can follow along with their eyes. That's why I need the timestamps. But you can imagine a similar use-case with ebook readers, news readers, etc. Any application where the user might want to follow along with their eyes as an AI voice reads a block of text.

I've heard similar kinds of requests from creators of animated avatars. But in those cases, the developers usually need timestamps for each phoneme, so that they can synchronize mouth movements and other facial animations with the AI voices.

Shiro836 commented 5 months ago

+1. I need word-level timestamps to make tts output look more interactive on the screen.

vatsalaggarwal commented 5 months ago

That makes sense, not sure when we'll have the time to get to it, in the meanwhile, this is probably something one can do using a forced alignment pipeline post generation?

danablend commented 4 months ago

@Shiro836 @benjismith For getting word-level timestamps I had great success using Kalpy (https://github.com/mmcauliffe/kalpy), which is a low-level wrapper of Kaldi (C library for speech processing).

The author of Kalpy is also the author of the widely used Montreal Forced Aligner (MFA) which is a higher level wrapper of Kaldi.

MFA is nice, but it loads and unloads models every time you do alignment, which causes a lot of overhead for small jobs like a quick "get the timestamps of this paragraph's audio" feature. It's built mostly for batch processing, like preparing massive datasets for training AI models for ASR and TTS.

Kalpy allows you more control as it's closer to the Kaldi C library, and you can keep the aligner models loaded in memory on your server, reducing latency for your users. Kalpy takes ~300ms to align ~3s of audio with Kalpy in my experience. MFA in comparison takes about 3-5s due to the model loading/unloading overhead.