Closed Purus closed 3 years ago
We discussed it a bit here:
https://github.com/JonnyBurger/remotion/issues/28#issuecomment-813941193
I found that there is non suitable browser API but of course you can use a cloud service from Google and AWS. I'm gonna leave this open because it's a great reminder to build a starter template for it and release it 🙂
Would be great if we could not be dependent on third-party cloud services.. May be something like https://github.com/MikeyParton/react-speech-kit
Here we are constrained by the APIs of the browser. While you can trigger TTS, you cannot access the audio data, put it in an audio tag, export it as a file, etc. :( This makes it impossible to render in a video.
To do TTS well requires that it can coordinate with animation timing. You’re likely to eventually combine automated translation with TSS and that means the audio lengths will vary in time so you need a means to time associated animation to the length of the resulting audio. You may also wish to display or highlight specific words (which may be translated) in sync with the matching audio.
If you’re combining word animation with translation, you want to perform the translation in one API call if possible, not just for performance but to provide the best translation context so, for example, the sequence of related words being animated are translated together.
I am adding a $60 bounty to whoever submits a text-to-speech example that works in preview mode as well as render mode and that uses a cloud TTS API. (We should be able to release the example as part of Remotion but of course you will maintain credit)
@jonnyburger has funded $60.00 to this issue.
To get the timing of the words spoken, you can reverse the process and use STT APIs (Google supports 125 languages and accents). Their API gives you a list of phrases spoken (“alternatives” array) and then within each phrase the timing of each individual word. So you can use these as the basis of the text animation timing. Of course STT AI is less reliable than TTS so the words may not always match the original input text sent to TTS so some manual or smart match up editing may be required. I’ve just used this timing info in a prototype app I’m working on for videos last Friday. This word/phrase timing is commonly used to generate video subtitles.
@JonnyBurger What do you think about credentials for this TTS services? Should store in .env settings and, when running video and use TTS check if it is set?
Good question!
We don't currently have support for .env files (but this is a great idea, I will create another issue for this). So the best way for the moment I think is using input props: https://www.remotion.dev/docs/parametrized-rendering#input-props
Do you see any problem at using 2 different voices? One at preview (SpeechSyntesis) and another during render (Cloud Voice). I was thinking about it and found some problem at this, because for render I need to download file, however I can't during preview since I haven't access to filesystem
Cloud Voice audio output can be saved to the cloud file system as a cache audio file or downloaded locally. Audio file name could be a hash of input text and voice params and so only recomputed when text/params change. You can then stream the local or cloud audio file.
One thing to keep in mind as you plan the roadmap for TTS (and perhaps STT) features is that there are important differences between what you can achieve with SpeechSynthesis and a corresponding cloud service. For simple TTS they could be swapped. Caching audio files generated from cloud service could improve the DX which may be sufficient for a basic MVP solution. However I recommend that you look further down the roadmap to consider where these cloud services might take you as you design the current Remotion speech architecture and APIs.
Cloud TTS services not only generate audio but they can also be used to generate associated events timed to correlated with the audio. Also different cloud services deliver different features.
Using STT you can emit a stream of words and phrases with associated audio track timing. This is typically used to generate subtitles but can be repurposed for animation. This could be used in conjunction with TTS or independently on any audio source containing speech. By pre-capturing these timed events, a frame render might wish to compute the time until the next word/phrase occurs or since the last word/phrase occurred and then use this to animate word text or other related object animations. This may be particularly important when handling text or audio translations as the speech timing can vary widely between translations.
Just discovered that IBM's TTS service can deliver wording timing removing the need for STT however they only support 16 languages
Azure STT can emit a stream of viseme lip sync events that can be used to animate lips on a 2D or 3D avatar.
Many existing audio & video files store timed events that could be be very useful in animation.
In summary, rendering may need to supporting multi-tracking timed events as input for the rendering process with the ability to compute time to previous and next events and associated event metadata.
I had working on it. Here is the example https://github.com/FelippeChemello/Remotion-TTS-Example. At IssueHunt asks me for making a PR, however I think that it wasn't the purpose of it Issue. How can I proceed @JonnyBurger ?
@FelippeChemello Wow thank you! This looks super awesome! I can only pay out if you submit a PR apparently, can you submit a bogus PR?
I will close it but pay it out immediately 🤝
Plus, it would be cool, if I could become a collaborator of the repository, so that should it be necessary, I can adjust the code, upgrade dependencies or change the README. Our goal is to provide a streamlined set of template for different usecases.
@jonnyburger has funded $55.00 to this issue.
@jonnyburger has rewarded $103.50 to @felippechemello. See it on IssueHunt
Would be great if there is support for converting text to speech automatically and include as part of the video output.
IssueHunt Summary
#### [ felippechemello](https://issuehunt.io/u/felippechemello) has been rewarded. ### Backers (Total: $115.00) - [ jonnyburger](https://issuehunt.io/u/jonnyburger) ($115.00) ### Submitted pull Requests - [#427 Bogus PR for https://github.com/JonnyBurger/remotion/issues/249](https://issuehunt.io/r/JonnyBurger/remotion/pull/427) --- ### Tips - Checkout the [Issuehunt explorer](https://issuehunt.io/r/JonnyBurger/remotion/) to discover more funded issues. - Need some help from other developers? [Add your repositories](https://issuehunt.io/r/new) on IssueHunt to raise funds.