Closed JpEncausse closed 2 years ago
I have a partial answer using SpeechSDK
let lang = 'fr-FR'
let voice = {'name' : 'fr-FR-DeniseNeural', 'gender' : 'female'}
// Retrieve TTS
let text = 'some text to speech'
let ssml = "<speak version='1.0' xml:lang='"+lang+"'>"
+ "<voice xml:lang='"+lang+"' xml:gender='"+voice.gender+"' name='"+voice.name+"'>"
+ text
+ "</voice>"
+ "</speak>"
// Speech Config
let speechConfig = SpeechSDK.SpeechConfig.fromSubscription('subscription-key', 'northeurope');
let stream = SpeechSDK.AudioOutputStream.createPullStream();
let audioConfig = SpeechSDK.AudioConfig.fromStreamOutput(stream);
let viseme = []
let synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
synthesizer.visemeReceived = (s, e) => {
viseme.push(e)
}
synthesizer.speakSsmlAsync(ssml, (result) => {
let result = { text, viseme, result, stream }
// forward the result
});
Few question/issues :
On the client side :
It works very very well ! and it's really tied to the voice flow (2 french voice : 2 different flow) that's why I need to give a try with Custom Neuro Voice
About Custom Neuro Voice, it seems the speakSsmlAsync()
is not authorized "Unsupported voice Jean-PhilippeNeural. websocket error code: 1007" so I switch to speakTextAsync()
with only the text but the result si not good.
But as for now
I think that may be, there is a change in the way the buffer is generated, I'll try to get a buffer from the audioConfig of the previous code (don't sure how)
Anyway here is a cool demo with Neuro and CustomNeuro https://www.youtube.com/watch?v=vLbQ2arXzRk
I don't have Tween transition with morphTargetInfluences because the result is not good (I assume it's too fast or I do it wrong)
Ok so I got it right if it could help others
const SpeechSDK = // get your Speech SDK
let subscription = // retrieve info from your subscription
let lang = // retrieve target langauge
let voice = subscription.voices[lang]; // I store voice config per lang per subscription
// Retrieve TTS
let ssml = "<speak version='1.0' xml:lang='"+lang+"'>"
+ "<voice xml:lang='"+lang+"' xml:gender='"+voice.gender+"' name='"+voice.name+"'>"
+ payload.tts
+ "</voice>"
+ "</speak>"
// Speech Config
let speechConfig = SpeechSDK.SpeechConfig.fromSubscription(subscription.key, subscription.region);
if (voice.endpointId){
speechConfig.endpointId = voice.endpointId
}
speechConfig.speechSynthesisLanguage = lang
speechConfig.speechSynthesisVoiceName = voice.name
let stream = SpeechSDK.AudioOutputStream.createPullStream();
let audioConfig = SpeechSDK.AudioConfig.fromStreamOutput(stream);
let viseme = []
let synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
synthesizer.visemeReceived = (s, e) => {
viseme.push(e)
}
let handleResult = (result) => {
synthesizer.close();
payload.viseme = viseme
payload.result = result
if (result.audioData){
payload.buffer = Buffer.from(result.audioData)
}
// HERE YOU CALL YOUR CALL BACK WITH YOUR PAYLOAD
}
if (subscription.ssml){
synthesizer.speakSsmlAsync(ssml, handleResult, (err) => { /* handle errors */) });
} else {
synthesizer.speakTextAsync(msg.payload.tts, handleResult, (err) => { /* handler errors */});
}
Very important things to understand, not so clear in documentation
speechConfig.endpointId
NOT speechConfig.setProperty('endpointId', '...');
speechSynthesisLanguage
and speechSynthesisVoiceName
in case you don't use SSMLdeployId
it's a REST query parameter but in SDK it is a endpointId
PullStream
to get the bufferHi @JpEncausse, sorry for the late reply and glad to see you figure out the issue. You are right that speechConfig.endpointId
needs to be set to use the custom voice.
Hello, I don't understand how Viseme should be used
Now I'd like to perform lipsync with an Avatar (from ReadyPlayer.me)
Why ?
Because I want/need to cache Viseme data with audio.
The current API seems to be a callback called when performing the live TTS but it's too smart and restrictive. May be underhood I can simple call the Text-To-Viseme and handle the timing myself ?
Regards