Ready Player Me with Microsoft Speech SDK

sabithpocker commented 1 year ago

Hey,

First of all, really amazing work that you are doing, I came here from some of your youtube videos, very interesting stuff with RPM and Unity.

Following this tutorial I did an example of me talking to OpenAI directly using Microsoft Speech SDK.

Most of the work is done, but my lipsync is not that great, Microsoft gives me LipSync as an array with FrameIndexes:

{
    "FrameIndex":0,
    "BlendShapes":[
        [0.021,0.321,...,0.258],
        [0.045,0.234,...,0.288],
        ...
    ]
}

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?pivots=programming-language-csharp&tabs=3dblendshapes#viseme-id

Do you suggest any way to use it with the Ready Player Me model?

Please ignore this if this is not something that interests you!!

srcnalt commented 1 year ago

Hi @sabithpocker, this project was to work directly with Google Mediapipe. But at the moment, I have access to Azure, and I think I can try this. There might be some mapping differences, if MS provides a way to map these array elements to blend shapes that should be easy.

sabithpocker commented 1 year ago

@srcnalt Apologies for the late reply, I am not spending full-time on this code and is only working on this as and when I am getting time.

Initially I tried mapping using different setups and later, ended up doing no mapping at all which works somewhat okay now but its not that great, I am using a very long URL: https://models.readyplayer.me/0000000000000000.glb?morphTargets=eyeBlinkLeft,eyeLookDownLeft,eyeLookInLeft,eyeLookOutLeft,eyeLookUpLeft,eyeSquintLeft,eyeWideLeft,eyeBlinkRight,eyeLookDownRight,eyeLookInRight,eyeLookOutRight,eyeLookUpRight,eyeSquintRight,eyeWideRight,jawForward,jawLeft,jawRight,jawOpen,mouthClose,mouthFunnel,mouthPucker,mouthLeft,mouthRight,mouthSmileLeft,mouthSmileRight,mouthFrownLeft,mouthFrownRight,mouthDimpleLeft,mouthDimpleRight,mouthStretchLeft,mouthStretchRight,mouthRollLower,mouthRollUpper,mouthShrugLower,mouthShrugUpper,mouthPressLeft,mouthPressRight,mouthLowerDownLeft,mouthLowerDownRight,mouthUpperUpLeft,mouthUpperUpRight,browDownLeft,browDownRight,browInnerUp,browOuterUpLeft,browOuterUpRight,cheekPuff,cheekSquintLeft,cheekSquintRight,noseSneerLeft,noseSneerRight,tongueOut&textureAtlas=1024

If you are interested in this and want to take a look, I can create a PR of the code I did on top of your code as a reference or send it as a zip, I'll omit the API Keys for azure and openAI.

sabithpocker commented 1 year ago

Here is some relevant code if you want to take a quick look:

 useFrame(() => {
      if(audioPlaying && player && masterViseme && masterViseme.length > 0) {
        if(player.privIsPaused) {
          player.resume();
        }
        blendShapeFrame = Math.round(audioFrametoBlendShapeFrame(player.currentTime, 0, duration.duration, 0, masterViseme.length));
        headMesh[0].morphTargetInfluences = masterViseme[blendShapeFrame] && masterViseme[blendShapeFrame].length > 0 ? masterViseme[blendShapeFrame] : Array(52).fill(0);

  });

VISEME RECIEVED EVENT

    synthesizer.visemeReceived = function (s: any, e: any) {
      let animationData: {BlendShapes: number [], FrameIndex: number} = JSON.parse(e.animation);
      masterViseme.push(...animationData.BlendShapes);
    };

Sample response for blendshapes:

{"FrameIndex":249,"BlendShapes":[[0.423,0.215,0,0.008,0,0.208,0,0.423,0.214,0.119,0,0,0.208,0,0.05,0.021,0,0.172,0.132,0.116,0.065,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.076,0.076,0.106,0,0,0.016,0.041,0.044,0.029,0.029,0,0.015,0,0.005],[0.502,0.282,0,0.002,0,0.222,0,0.502,0.281,0.112,0,0,0.223,0,0.05,0.021,0,0.172,0.133,0.116,0.066,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.074,0.074,0.111,0,0,0.016,0.041,0.044,0.029,0.029,0,0.017,0,0.006],[0.464,0.247,0,0.011,0,0.23,0,0.464,0.247,0.122,0,0,0.23,0,0.05,0.021,0,0.173,0.133,0.116,0.067,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.113,0,0,0.016,0.041,0.044,0.029,0.029,0,0.017,0.001,0.006],[0.35,0.186,0,0.012,0,0.234,0,0.35,0.186,0.123,0,0,0.234,0,0.05,0.021,0,0.173,0.133,0.117,0.067,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.043,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.114,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.004],[0.229,0.12,0,0.017,0,0.233,0,0.229,0.119,0.128,0,0,0.233,0,0.05,0.021,0,0.173,0.134,0.117,0.068,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.114,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.003],[0.142,0.063,0,0.027,0,0.225,0,0.143,0.063,0.139,0,0,0.225,0,0.05,0.021,0,0.174,0.134,0.117,0.069,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.113,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.002],[0.103,0.032,0,0.022,0,0.213,0,0.103,0.032,0.134,0,0,0.213,0,0.05,0.021,0,0.174,0.135,0.117,0.07,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.072,0.072,0.111,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.001],[0.072,0.012,0,0.019,0,0.203,0,0.072,0.012,0.131,0,0,0.203,0,0.05,0.021,0,0.174,0.135,0.117,0.07,0.008,0.003,0.006,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.073,0.073,0.108,0,0,0.016,0.041,0.044,0.029,0.029,0,0.019,0,0],[0.04,0.001,0,0.016,0,0.195,0,0.04,0.001,0.128,0,0,0.195,0,0.05,0.021,0,0.175,0.136,0.117,0.071,0.008,0.003,0.006,0.015,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.075,0.075,0.104,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0],[0.022,0,0,0.016,0.001,0.188,0,0.022,0,0.128,0,0.001,0.188,0,0.05,0.021,0,0.175,0.137,0.116,0.073,0.008,0.003,0.007,0.016,0.018,0.012,0.042,0.039,0.092,0.074,0.056,0.045,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.08,0.08,0.099,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0],[0.012,0,0,0.012,0.002,0.182,0,0.012,0,0.125,0,0.002,0.182,0,0.05,0.021,0,0.178,0.14,0.116,0.076,0.008,0.003,0.007,0.016,0.019,0.013,0.042,0.039,0.092,0.074,0.057,0.046,0.014,0.075,0.017,0.018,0.177,0.171,0.015,0.015,0.085,0.085,0.096,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0],[0.007,0,0,0.01,0.005,0.177,0,0.007,0,0.123,0,0.005,0.178,0,0.05,0.021,0,0.178,0.141,0.117,0.077,0.008,0.003,0.007,0.015,0.019,0.013,0.042,0.038,0.092,0.074,0.057,0.045,0.014,0.075,0.017,0.018,0.176,0.171,0.015,0.015,0.088,0.088,0.093,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0]]}

Trying to distribute the blend frames received into the duration of the audio:

const audioFrametoBlendShapeFrame =  (audioFrame: number, audioMin = 0, audioMax: number, blendFrameMin = 0, belndFrameMax: number) :number => {
  return (audioFrame - audioMin) * (belndFrameMax - blendFrameMin) / (audioMax - audioMin) + blendFrameMin;
}

srcnalt commented 1 year ago

Thanks for the details, I took a look at the Azure blendahpes and seems like they better mapped on VIseme blendshapes and not ARKit ones.

Sadly this is not gonna be 100% accurate but should help https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=visemeid&pivots=programming-language-csharp

Also you can pass group names as morphTarget value to shorten the URL https://docs.readyplayer.me/ready-player-me/api-reference/rest-api/avatars/get-3d-avatars#examples-7

public static Dictionary<int, int> VisemeMap = new Dictionary<int, int>()
{
    {0, 0},   // viseme_sil
    {1, 10},  // viseme_aa
    {2, 10},  // viseme_aa
    {3, 13},  // viseme_OO
    {4, 11},  // viseme_E
    {5, 11},  // viseme_E
    {6, 12},  // viseme_I
    {7, 14},  // viseme_U
    {8, 13},  // viseme_O
    {9, 10},  // viseme_aa
    {10, 13}, // viseme_OO
    {11, 10}, // viseme_aa
    {12, 3},  // viseme_TH
    {13, 13}, // viseme_O
    {14, 12}, // viseme_I
    {15, 7},  // viseme_SS
    {16, 6},  // viseme_CH
    {17, 4},  // viseme_DD
    {18, 2},  // viseme_FF
    {19, 8},  // viseme_nn
    {20, 5},  // viseme_kk
    {21, 1},  // viseme_PP 
};

srcnalt / rpm-face-tracking

Ready Player Me with Microsoft Speech SDK #3