rany2 / edge-srt-to-speech

Convert SubRip to speech using Microsoft Edge's TTS service
https://pypi.org/project/edge-srt-to-speech/
GNU General Public License v3.0
44 stars 10 forks source link

[Feature Request] Support for SSML templates #2

Closed photkey closed 2 years ago

photkey commented 2 years ago

--ssml_template path/example.xml

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <mstts:express-as style="cheerful">
            {text}
        </mstts:express-as>
    </voice>
</speak>

where "{text}" is replaced by the current sentence.

--ssml_elements "voice:en-US-SaraNeural,speed:+1%,style:cheerful" This format, potentially, is simpler to write for the user, but because it needs to be first written into the SSML template, it requires more work on your part to do so.

A single sentence, in addition to being able to support SSML templates, then add support for using full SSML files instead of SSML templates, thus enabling finer control of the voice read aloud effect.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <mstts:express-as style="cheerful">
            That'd be just amazing!
        </mstts:express-as>
    </voice>
</speak>

These features can be tricky to implement, so please decide which parts you want to implement based on the level of difficulty and your own time.

rany2 commented 2 years ago

Example usage:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="{voice}">
        <mstts:express-as style="{emotion}">
            {text}
        </mstts:express-as>
    </voice>
</speak>
1
00:00:00,596 --> 00:00:04,744
Hello world, I am very happy!

2
00:00:4,844 --> 00:00:06,744
Wow, does this really work?

3
00:00:6,744 --> 00:00:10,744
I am so sorry for your loss...
edge_tts{emotion:depressed}
rany2 commented 2 years ago

By default the following variables are expanded based on your template (whether it includes it or not):

rany2 commented 2 years ago

BTW I noticed I accidentally broke anything that is not template. Will fix soon

rany2 commented 2 years ago

Fixed in https://github.com/rany2/edge-srt-to-speech/commit/60b1dc85fb877576035ebaa0e32e346d564a882a

photkey commented 2 years ago

Wow O, that came true so quickly, you really are Superman, that's awesome! Trust me, this project is going to be a hit! I'll test the new features first.

By the way, I've been suffering from not having too many SRT files to test before, downloading SRT files from movies that aren't really suitable for this kind of dubbing. today it occurred to me that I could filter tutorial-type videos with SRT subtitles from YouTube, so I could easily have a lot of suitable SRT files, and then re-dub them with edge-srt-to-speech to test the effect.

rany2 commented 2 years ago

@photkey So, did it work fine or would you like some adjustments? In future I think I'll work on a GUI (either something with TK or a web interface, probably the latter)

photkey commented 2 years ago

Sorry, I didn't have time to test it yesterday. today I will test it, and I have a little doubt about how to use it. here is the example you gave. CLI command: edge_srt_to_speech --ssml-template example.template example.srt example.mp3 --ssml-default-variables emotion:cheerful --ssml-template--ssml-default-variables Can both be used at the same time, or only one at the same time, and if so, which one has higher priority?

photkey commented 2 years ago

I'm not sure what features you want to implement in an application with a GUI interface, if it's just setting those parameters, I'd probably personally be more partial to a desktop application developed using TK or pySide6, with support for passing in parameters to start. My ultimate ideal for this project would be to have an editor like https://speech.microsoft.com/audiocontentcreation that would make it very easy to refine every sentence in SRT, on top of the existing functionality, which of course is a long way off.

rany2 commented 2 years ago

Sorry, I didn't have time to test it yesterday. today I will test it, and I have a little doubt about how to use it. here is the example you gave. CLI command: edge_srt_to_speech --ssml-template example.template example.srt example.mp3 --ssml-default-variables emotion:cheerful --ssml-template--ssml-default-variables Can both be used at the same time, or only one at the same time, and if so, which one has higher priority?

ssml-default-variables set the default variable for the template, which you could then override with the regular edge_tts{blah:y} syntax.

So in the example the default variable for emotion is cheerful which could be overriden with edge_tts{emotion:somethingelse}

rany2 commented 2 years ago

You need --ssml-template to be able to use --ssml-default-variables.

rany2 commented 2 years ago

Also in order to specify more variables you need to delimit them with a comma. So for example, --ssml-default-variables emotion:cheerful,rate:x-fast

photkey commented 2 years ago

I've tested it and haven't encountered any problems. Just have a question and need to confirm with you. Since I know nothing about SSML before, I also read through Microsoft's documentation about SSML in the past two days, and the documentation says that for attributes that are not present in the currently selected voice, the points will be automatically ignored, and my understanding is that this does not affect the generated voice, just that the generated voice may not contain this attribute. Is my understanding correct? If my understanding is correct, can the following SSML template be applied to all voices without error? (I have not encountered any errors during my testing)

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="{lang}">
    <voice name="{voice}">
        <mstts:express-as role="{role}" style="{style}" styledegree="{styledegree}">
        <prosody pitch="{pitch}" rate="{rate}" volume="{volume}">
            {text}
        </prosody>
        </mstts:express-as>
    </voice>
</speak>
rany2 commented 2 years ago

Yes