Viseme vs Custom Neuro Voice

JpEncausse commented 2 years ago

Hello, I don't understand how Viseme should be used

I call the REST API for Custom Neuro Voice
Send and Cache the audio to my SmartMirror

Now I'd like to perform lipsync with an Avatar (from ReadyPlayer.me)

Can I call a REST API for Viseme ? instead of usind the SpeechSDK ?
Can I receive a full JSON { timing : VisemeID }
Or can I have a Viseme data along (custom) NeuroVoice REST call ?

Why ?

Because I want/need to cache Viseme data with audio.

The current API seems to be a callback called when performing the live TTS but it's too smart and restrictive. May be underhood I can simple call the Text-To-Viseme and handle the timing myself ?

Regards

JpEncausse commented 2 years ago

I have a partial answer using SpeechSDK

let lang      = 'fr-FR'
let voice     = {'name' : 'fr-FR-DeniseNeural', 'gender' : 'female'}

// Retrieve TTS
let text = 'some text to speech'
let ssml = "<speak version='1.0' xml:lang='"+lang+"'>"
         + "<voice xml:lang='"+lang+"' xml:gender='"+voice.gender+"' name='"+voice.name+"'>"
         + text
         + "</voice>"
         + "</speak>"

// Speech Config
let speechConfig = SpeechSDK.SpeechConfig.fromSubscription('subscription-key', 'northeurope');
let stream = SpeechSDK.AudioOutputStream.createPullStream();
let audioConfig  = SpeechSDK.AudioConfig.fromStreamOutput(stream);

let viseme = []
let synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
synthesizer.visemeReceived = (s, e) => {
    viseme.push(e)
}
synthesizer.speakSsmlAsync(ssml, (result) => {
    let result = { text, viseme, result, stream }
    // forward the result
});

Few question/issues :

How does it works with Custom Neuro Voice ? I have to dig in the documentation to config with a deployId
I store the result into an Array then convert to Occulus / ReadyPlayer.Me format (it works well)
I receive a stream I may not use (have to think about it)

On the client side :

I send buffer / text / viseme to the browser
Then start playing the audio buffer
And start a timer with setTimeout() to next offset
for each step, I update ThreeJS / morphTargetInfluences

It works very very well ! and it's really tied to the voice flow (2 french voice : 2 different flow) that's why I need to give a try with Custom Neuro Voice

JpEncausse commented 2 years ago

About Custom Neuro Voice, it seems the speakSsmlAsync() is not authorized "Unsupported voice Jean-PhilippeNeural. websocket error code: 1007" so I switch to speakTextAsync() with only the text but the result si not good.

But as for now

I use the REST API with SSML to generete the CustomNeuroVoice audio
then the previous code for the Viseme

I think that may be, there is a change in the way the buffer is generated, I'll try to get a buffer from the audioConfig of the previous code (don't sure how)

Anyway here is a cool demo with Neuro and CustomNeuro https://www.youtube.com/watch?v=vLbQ2arXzRk

I don't have Tween transition with morphTargetInfluences because the result is not good (I assume it's too fast or I do it wrong)

JpEncausse commented 2 years ago

Ok so I got it right if it could help others

const SpeechSDK = // get your Speech SDK
let subscription = // retrieve info from your subscription
let lang    = // retrieve target langauge
let voice  = subscription.voices[lang]; // I store voice config per lang per subscription

// Retrieve TTS
let ssml = "<speak version='1.0' xml:lang='"+lang+"'>"
         + "<voice xml:lang='"+lang+"' xml:gender='"+voice.gender+"' name='"+voice.name+"'>"
         + payload.tts
         + "</voice>"
         + "</speak>"

// Speech Config
let speechConfig = SpeechSDK.SpeechConfig.fromSubscription(subscription.key, subscription.region);
if (voice.endpointId){ 
    speechConfig.endpointId = voice.endpointId
}
speechConfig.speechSynthesisLanguage = lang
speechConfig.speechSynthesisVoiceName = voice.name

let stream       = SpeechSDK.AudioOutputStream.createPullStream();
let audioConfig  = SpeechSDK.AudioConfig.fromStreamOutput(stream);

let viseme = []
let synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);

synthesizer.visemeReceived = (s, e) => {
    viseme.push(e)
}

let handleResult = (result) => {
    synthesizer.close();
    payload.viseme = viseme
    payload.result = result
    if (result.audioData){
        payload.buffer = Buffer.from(result.audioData) 
    }
    // HERE YOU CALL YOUR CALL BACK WITH YOUR PAYLOAD
}

if (subscription.ssml){
    synthesizer.speakSsmlAsync(ssml, handleResult, (err) => { /* handle errors */) });
} else {
    synthesizer.speakTextAsync(msg.payload.tts, handleResult, (err) => { /* handler errors */});
}

Very important things to understand, not so clear in documentation

You MUST call speechConfig.endpointId NOT speechConfig.setProperty('endpointId', '...');
DO the same for speechSynthesisLanguage and speechSynthesisVoiceName in case you don't use SSML
Don't care about deployId it's a REST query parameter but in SDK it is a endpointId
Remember the trick with PullStream to get the buffer

yulin-li commented 2 years ago

Hi @JpEncausse, sorry for the late reply and glad to see you figure out the issue. You are right that speechConfig.endpointId needs to be set to use the custom voice.

microsoft / cognitive-services-speech-sdk-js

Viseme vs Custom Neuro Voice #523