ndarilek / tts-rs

115 stars 25 forks source link

Support voice synthesis to Vec<u8> #30

Open Bear-03 opened 2 years ago

Bear-03 commented 2 years ago

This aims to solve #12.

Bear-03 commented 2 years ago

I've only implemented WinRT for now, but I'll look into how to implement it for the other backends.

ndarilek commented 2 years ago

neat, thanks! More backends would be great--what often happens is folks do one and I end up having to do the rest. :) You won't be able to do tolk, but if you can at least cover web, I'll look into the others.

Also, I wonder if we should use Vec<u8> or some other, slightly smarter container for audio? I'd like to be sure there's a known output format for whatever audio data we get, and I'm concerned that each synth might have its own concept of what format to use for synthesized audio. So we might end up with a situation where different platforms output different formats and the crate is no longer cross-platform.

Thanks again.

Bear-03 commented 2 years ago

I'd like to be sure there's a known output format for whatever audio data we get

For my current project (where I'm going to be using tts-rs) I use PCM, specifically PCM 16-bit. PCM 16-bit, floating point and unsigned 16-bit are the three formats that cpal supports, and since it is a popular crate, I'd assume implementing those would be more than enough.

I remember reading that the bytes WinRT returns are already PCM, but I'm not too sure, we could do some research. Of course, keeping the library cross-platform is a priority.

ndarilek commented 2 years ago

Gotcha. I'm guessing the way forward is to synthesize to something other than a raw vec but which indicates its format. Then, if we discover everything just happens to use the same format, we can drop that requirement and just send raw bytes. I feel like whenever I have to pipe bytes to an audio library, I'm often required to know things about them (I.e. sample rate, bit depth, etc.) I want to make sure we're giving folks that information if it's going to differ from engine to engine so they don't have to figure it out themselves.

Thanks again.

Bear-03 commented 2 years ago

Yes, now that you said it, that's true, you're often required to provide a lot of parameters to play audio or save it to a file. I'm pretty sure that those are constant for a given backend, so it would be a matter of creating something like a Spec struct that holds that data for each backend.

ndarilek commented 2 years ago

Does cpal not have some sort of audio container with all this data that we can return directly? I'm a bit hesitant to have the audio parameters be a separate thing you need access to--I'd rather the return value include everything necessary, if possible.

Bear-03 commented 2 years ago

cpal uses SupportedStreamConfig, which holds the info about your input/output device.

Returning the audio metadata every time you synthesize would be wasteful, in my opinion, as you're usingresources for things that aren't really needed. The audio metadata won't change during runtime, so generating it once and letting the developer store it is far more efficient.

ndarilek commented 2 years ago

Gotcha, I'd hoped it'd be part of the returned container. Anyhow, if there's some way we can autogenerate it once and cache it, that might be useful. I'm a bit concerned about these formats changing, and of having to maintain/sync hard-coded structs. But maybe that's not warranted. I'll see what you come up with.

Bear-03 commented 2 years ago

if there's some way we can autogenerate it once and cache it, that might be useful

And then return it in the container?

I'm a bit concerned about these formats changing, and of having to maintain/sync hard-coded structs

I am sure that it is impossible to retrieve the audio metadata from the audio bytes themselves, as you need the metadata first to then interpret them.

Afaik WinRT doesn't have any way to get the audio spec (I'll have a look), so the only alternative is hard-coding it. My idea is to have something like min_rate(), normal_rate() and max_rate(), that returns the Spec for each backend.