Closed photkey closed 2 years ago
Example usage:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="{voice}">
<mstts:express-as style="{emotion}">
{text}
</mstts:express-as>
</voice>
</speak>
1
00:00:00,596 --> 00:00:04,744
Hello world, I am very happy!
2
00:00:4,844 --> 00:00:06,744
Wow, does this really work?
3
00:00:6,744 --> 00:00:10,744
I am so sorry for your loss...
edge_tts{emotion:depressed}
edge_srt_to_speech --ssml-template example.template example.srt example.mp3 --ssml-default-variables emotion:cheerful
By default the following variables are expanded based on your template (whether it includes it or not):
BTW I noticed I accidentally broke anything that is not template. Will fix soon
Wow O, that came true so quickly, you really are Superman, that's awesome! Trust me, this project is going to be a hit! I'll test the new features first.
By the way, I've been suffering from not having too many SRT files to test before, downloading SRT files from movies that aren't really suitable for this kind of dubbing. today it occurred to me that I could filter tutorial-type videos with SRT subtitles from YouTube, so I could easily have a lot of suitable SRT files, and then re-dub them with edge-srt-to-speech to test the effect.
@photkey So, did it work fine or would you like some adjustments? In future I think I'll work on a GUI (either something with TK or a web interface, probably the latter)
Sorry, I didn't have time to test it yesterday. today I will test it, and I have a little doubt about how to use it. here is the example you gave.
CLI command: edge_srt_to_speech --ssml-template example.template example.srt example.mp3 --ssml-default-variables emotion:cheerful
--ssml-template
,--ssml-default-variables
Can both be used at the same time, or only one at the same time, and if so, which one has higher priority?
I'm not sure what features you want to implement in an application with a GUI interface, if it's just setting those parameters, I'd probably personally be more partial to a desktop application developed using TK or pySide6, with support for passing in parameters to start. My ultimate ideal for this project would be to have an editor like https://speech.microsoft.com/audiocontentcreation that would make it very easy to refine every sentence in SRT, on top of the existing functionality, which of course is a long way off.
Sorry, I didn't have time to test it yesterday. today I will test it, and I have a little doubt about how to use it. here is the example you gave. CLI command:
edge_srt_to_speech --ssml-template example.template example.srt example.mp3 --ssml-default-variables emotion:cheerful
--ssml-template
,--ssml-default-variables
Can both be used at the same time, or only one at the same time, and if so, which one has higher priority?
ssml-default-variables set the default variable for the template, which you could then override with the regular edge_tts{blah:y}
syntax.
So in the example the default variable for emotion is cheerful which could be overriden with edge_tts{emotion:somethingelse}
You need --ssml-template
to be able to use --ssml-default-variables
.
Also in order to specify more variables you need to delimit them with a comma. So for example, --ssml-default-variables emotion:cheerful,rate:x-fast
I've tested it and haven't encountered any problems. Just have a question and need to confirm with you. Since I know nothing about SSML before, I also read through Microsoft's documentation about SSML in the past two days, and the documentation says that for attributes that are not present in the currently selected voice, the points will be automatically ignored, and my understanding is that this does not affect the generated voice, just that the generated voice may not contain this attribute. Is my understanding correct? If my understanding is correct, can the following SSML template be applied to all voices without error? (I have not encountered any errors during my testing)
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="{lang}">
<voice name="{voice}">
<mstts:express-as role="{role}" style="{style}" styledegree="{styledegree}">
<prosody pitch="{pitch}" rate="{rate}" volume="{volume}">
{text}
</prosody>
</mstts:express-as>
</voice>
</speak>
Yes
--ssml_template path/example.xml
where "{text}" is replaced by the current sentence.
--ssml_elements "voice:en-US-SaraNeural,speed:+1%,style:cheerful"
This format, potentially, is simpler to write for the user, but because it needs to be first written into the SSML template, it requires more work on your part to do so.A single sentence, in addition to being able to support SSML templates, then add support for using full SSML files instead of SSML templates, thus enabling finer control of the voice read aloud effect.
These features can be tricky to implement, so please decide which parts you want to implement based on the level of difficulty and your own time.