remotion-dev / remotion

🎥 Make videos programmatically with React
https://remotion.dev
Other
20.03k stars 989 forks source link

TTS example #249

Closed Purus closed 3 years ago

Purus commented 3 years ago

Issuehunt badges

Would be great if there is support for converting text to speech automatically and include as part of the video output.


IssueHunt Summary #### [felippechemello felippechemello](https://issuehunt.io/u/felippechemello) has been rewarded. ### Backers (Total: $115.00) - [jonnyburger jonnyburger](https://issuehunt.io/u/jonnyburger) ($115.00) ### Submitted pull Requests - [#427 Bogus PR for https://github.com/JonnyBurger/remotion/issues/249](https://issuehunt.io/r/JonnyBurger/remotion/pull/427) --- ### Tips - Checkout the [Issuehunt explorer](https://issuehunt.io/r/JonnyBurger/remotion/) to discover more funded issues. - Need some help from other developers? [Add your repositories](https://issuehunt.io/r/new) on IssueHunt to raise funds.
JonnyBurger commented 3 years ago

We discussed it a bit here:

https://github.com/JonnyBurger/remotion/issues/28#issuecomment-813941193

I found that there is non suitable browser API but of course you can use a cloud service from Google and AWS. I'm gonna leave this open because it's a great reminder to build a starter template for it and release it 🙂

Purus commented 3 years ago

Would be great if we could not be dependent on third-party cloud services.. May be something like https://github.com/MikeyParton/react-speech-kit

JonnyBurger commented 3 years ago

Here we are constrained by the APIs of the browser. While you can trigger TTS, you cannot access the audio data, put it in an audio tag, export it as a file, etc. :( This makes it impossible to render in a video.

tohagan commented 3 years ago

To do TTS well requires that it can coordinate with animation timing. You’re likely to eventually combine automated translation with TSS and that means the audio lengths will vary in time so you need a means to time associated animation to the length of the resulting audio. You may also wish to display or highlight specific words (which may be translated) in sync with the matching audio.

tohagan commented 3 years ago

If you’re combining word animation with translation, you want to perform the translation in one API call if possible, not just for performance but to provide the best translation context so, for example, the sequence of related words being animated are translated together.

JonnyBurger commented 3 years ago

I am adding a $60 bounty to whoever submits a text-to-speech example that works in preview mode as well as render mode and that uses a cloud TTS API. (We should be able to release the example as part of Remotion but of course you will maintain credit)

issuehunt-oss[bot] commented 3 years ago

@jonnyburger has funded $60.00 to this issue.


tohagan commented 3 years ago

To get the timing of the words spoken, you can reverse the process and use STT APIs (Google supports 125 languages and accents). Their API gives you a list of phrases spoken (“alternatives” array) and then within each phrase the timing of each individual word. So you can use these as the basis of the text animation timing. Of course STT AI is less reliable than TTS so the words may not always match the original input text sent to TTS so some manual or smart match up editing may be required. I’ve just used this timing info in a prototype app I’m working on for videos last Friday. This word/phrase timing is commonly used to generate video subtitles.

FelippeChemello commented 3 years ago

@JonnyBurger What do you think about credentials for this TTS services? Should store in .env settings and, when running video and use TTS check if it is set?

JonnyBurger commented 3 years ago

Good question!

We don't currently have support for .env files (but this is a great idea, I will create another issue for this). So the best way for the moment I think is using input props: https://www.remotion.dev/docs/parametrized-rendering#input-props

FelippeChemello commented 3 years ago

Do you see any problem at using 2 different voices? One at preview (SpeechSyntesis) and another during render (Cloud Voice). I was thinking about it and found some problem at this, because for render I need to download file, however I can't during preview since I haven't access to filesystem

tohagan commented 3 years ago

Cloud Voice audio output can be saved to the cloud file system as a cache audio file or downloaded locally. Audio file name could be a hash of input text and voice params and so only recomputed when text/params change. You can then stream the local or cloud audio file.

tohagan commented 3 years ago

Swapping Chrome APIs with Cloud APIs

One thing to keep in mind as you plan the roadmap for TTS (and perhaps STT) features is that there are important differences between what you can achieve with SpeechSynthesis and a corresponding cloud service. For simple TTS they could be swapped. Caching audio files generated from cloud service could improve the DX which may be sufficient for a basic MVP solution. However I recommend that you look further down the roadmap to consider where these cloud services might take you as you design the current Remotion speech architecture and APIs.

Cloud service audio events

Cloud TTS services not only generate audio but they can also be used to generate associated events timed to correlated with the audio. Also different cloud services deliver different features.

Word/Phrase events

Using STT you can emit a stream of words and phrases with associated audio track timing. This is typically used to generate subtitles but can be repurposed for animation. This could be used in conjunction with TTS or independently on any audio source containing speech. By pre-capturing these timed events, a frame render might wish to compute the time until the next word/phrase occurs or since the last word/phrase occurred and then use this to animate word text or other related object animations. This may be particularly important when handling text or audio translations as the speech timing can vary widely between translations.

Just discovered that IBM's TTS service can deliver wording timing removing the need for STT however they only support 16 languages

Lip Sync events

Azure STT can emit a stream of viseme lip sync events that can be used to animate lips on a 2D or 3D avatar.

Audio / Video track events

Many existing audio & video files store timed events that could be be very useful in animation.

In summary, rendering may need to supporting multi-tracking timed events as input for the rendering process with the ability to compute time to previous and next events and associated event metadata.

FelippeChemello commented 3 years ago

I had working on it. Here is the example https://github.com/FelippeChemello/Remotion-TTS-Example. At IssueHunt asks me for making a PR, however I think that it wasn't the purpose of it Issue. How can I proceed @JonnyBurger ?

JonnyBurger commented 3 years ago

@FelippeChemello Wow thank you! This looks super awesome! I can only pay out if you submit a PR apparently, can you submit a bogus PR?

I will close it but pay it out immediately 🤝

Plus, it would be cool, if I could become a collaborator of the repository, so that should it be necessary, I can adjust the code, upgrade dependencies or change the README. Our goal is to provide a streamlined set of template for different usecases.

issuehunt-oss[bot] commented 3 years ago

@jonnyburger has funded $55.00 to this issue.


issuehunt-oss[bot] commented 3 years ago

@jonnyburger has rewarded $103.50 to @felippechemello. See it on IssueHunt