Open sabithpocker opened 1 year ago
Hi @sabithpocker, this project was to work directly with Google Mediapipe. But at the moment, I have access to Azure, and I think I can try this. There might be some mapping differences, if MS provides a way to map these array elements to blend shapes that should be easy.
@srcnalt Apologies for the late reply, I am not spending full-time on this code and is only working on this as and when I am getting time.
Initially I tried mapping using different setups and later, ended up doing no mapping at all which works somewhat okay now but its not that great, I am using a very long URL: https://models.readyplayer.me/0000000000000000.glb?morphTargets=eyeBlinkLeft,eyeLookDownLeft,eyeLookInLeft,eyeLookOutLeft,eyeLookUpLeft,eyeSquintLeft,eyeWideLeft,eyeBlinkRight,eyeLookDownRight,eyeLookInRight,eyeLookOutRight,eyeLookUpRight,eyeSquintRight,eyeWideRight,jawForward,jawLeft,jawRight,jawOpen,mouthClose,mouthFunnel,mouthPucker,mouthLeft,mouthRight,mouthSmileLeft,mouthSmileRight,mouthFrownLeft,mouthFrownRight,mouthDimpleLeft,mouthDimpleRight,mouthStretchLeft,mouthStretchRight,mouthRollLower,mouthRollUpper,mouthShrugLower,mouthShrugUpper,mouthPressLeft,mouthPressRight,mouthLowerDownLeft,mouthLowerDownRight,mouthUpperUpLeft,mouthUpperUpRight,browDownLeft,browDownRight,browInnerUp,browOuterUpLeft,browOuterUpRight,cheekPuff,cheekSquintLeft,cheekSquintRight,noseSneerLeft,noseSneerRight,tongueOut&textureAtlas=1024
If you are interested in this and want to take a look, I can create a PR of the code I did on top of your code as a reference or send it as a zip, I'll omit the API Keys for azure and openAI.
Here is some relevant code if you want to take a quick look:
useFrame(() => {
if(audioPlaying && player && masterViseme && masterViseme.length > 0) {
if(player.privIsPaused) {
player.resume();
}
blendShapeFrame = Math.round(audioFrametoBlendShapeFrame(player.currentTime, 0, duration.duration, 0, masterViseme.length));
headMesh[0].morphTargetInfluences = masterViseme[blendShapeFrame] && masterViseme[blendShapeFrame].length > 0 ? masterViseme[blendShapeFrame] : Array(52).fill(0);
});
VISEME RECIEVED EVENT
synthesizer.visemeReceived = function (s: any, e: any) {
let animationData: {BlendShapes: number [], FrameIndex: number} = JSON.parse(e.animation);
masterViseme.push(...animationData.BlendShapes);
};
Sample response for blendshapes:
{"FrameIndex":249,"BlendShapes":[[0.423,0.215,0,0.008,0,0.208,0,0.423,0.214,0.119,0,0,0.208,0,0.05,0.021,0,0.172,0.132,0.116,0.065,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.076,0.076,0.106,0,0,0.016,0.041,0.044,0.029,0.029,0,0.015,0,0.005],[0.502,0.282,0,0.002,0,0.222,0,0.502,0.281,0.112,0,0,0.223,0,0.05,0.021,0,0.172,0.133,0.116,0.066,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.074,0.074,0.111,0,0,0.016,0.041,0.044,0.029,0.029,0,0.017,0,0.006],[0.464,0.247,0,0.011,0,0.23,0,0.464,0.247,0.122,0,0,0.23,0,0.05,0.021,0,0.173,0.133,0.116,0.067,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.113,0,0,0.016,0.041,0.044,0.029,0.029,0,0.017,0.001,0.006],[0.35,0.186,0,0.012,0,0.234,0,0.35,0.186,0.123,0,0,0.234,0,0.05,0.021,0,0.173,0.133,0.117,0.067,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.043,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.114,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.004],[0.229,0.12,0,0.017,0,0.233,0,0.229,0.119,0.128,0,0,0.233,0,0.05,0.021,0,0.173,0.134,0.117,0.068,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.114,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.003],[0.142,0.063,0,0.027,0,0.225,0,0.143,0.063,0.139,0,0,0.225,0,0.05,0.021,0,0.174,0.134,0.117,0.069,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.113,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.002],[0.103,0.032,0,0.022,0,0.213,0,0.103,0.032,0.134,0,0,0.213,0,0.05,0.021,0,0.174,0.135,0.117,0.07,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.072,0.072,0.111,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.001],[0.072,0.012,0,0.019,0,0.203,0,0.072,0.012,0.131,0,0,0.203,0,0.05,0.021,0,0.174,0.135,0.117,0.07,0.008,0.003,0.006,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.073,0.073,0.108,0,0,0.016,0.041,0.044,0.029,0.029,0,0.019,0,0],[0.04,0.001,0,0.016,0,0.195,0,0.04,0.001,0.128,0,0,0.195,0,0.05,0.021,0,0.175,0.136,0.117,0.071,0.008,0.003,0.006,0.015,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.075,0.075,0.104,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0],[0.022,0,0,0.016,0.001,0.188,0,0.022,0,0.128,0,0.001,0.188,0,0.05,0.021,0,0.175,0.137,0.116,0.073,0.008,0.003,0.007,0.016,0.018,0.012,0.042,0.039,0.092,0.074,0.056,0.045,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.08,0.08,0.099,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0],[0.012,0,0,0.012,0.002,0.182,0,0.012,0,0.125,0,0.002,0.182,0,0.05,0.021,0,0.178,0.14,0.116,0.076,0.008,0.003,0.007,0.016,0.019,0.013,0.042,0.039,0.092,0.074,0.057,0.046,0.014,0.075,0.017,0.018,0.177,0.171,0.015,0.015,0.085,0.085,0.096,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0],[0.007,0,0,0.01,0.005,0.177,0,0.007,0,0.123,0,0.005,0.178,0,0.05,0.021,0,0.178,0.141,0.117,0.077,0.008,0.003,0.007,0.015,0.019,0.013,0.042,0.038,0.092,0.074,0.057,0.045,0.014,0.075,0.017,0.018,0.176,0.171,0.015,0.015,0.088,0.088,0.093,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0]]}
Trying to distribute the blend frames received into the duration of the audio:
const audioFrametoBlendShapeFrame = (audioFrame: number, audioMin = 0, audioMax: number, blendFrameMin = 0, belndFrameMax: number) :number => {
return (audioFrame - audioMin) * (belndFrameMax - blendFrameMin) / (audioMax - audioMin) + blendFrameMin;
}
Thanks for the details, I took a look at the Azure blendahpes and seems like they better mapped on VIseme blendshapes and not ARKit ones.
Sadly this is not gonna be 100% accurate but should help https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=visemeid&pivots=programming-language-csharp
Also you can pass group names as morphTarget value to shorten the URL https://docs.readyplayer.me/ready-player-me/api-reference/rest-api/avatars/get-3d-avatars#examples-7
public static Dictionary<int, int> VisemeMap = new Dictionary<int, int>()
{
{0, 0}, // viseme_sil
{1, 10}, // viseme_aa
{2, 10}, // viseme_aa
{3, 13}, // viseme_OO
{4, 11}, // viseme_E
{5, 11}, // viseme_E
{6, 12}, // viseme_I
{7, 14}, // viseme_U
{8, 13}, // viseme_O
{9, 10}, // viseme_aa
{10, 13}, // viseme_OO
{11, 10}, // viseme_aa
{12, 3}, // viseme_TH
{13, 13}, // viseme_O
{14, 12}, // viseme_I
{15, 7}, // viseme_SS
{16, 6}, // viseme_CH
{17, 4}, // viseme_DD
{18, 2}, // viseme_FF
{19, 8}, // viseme_nn
{20, 5}, // viseme_kk
{21, 1}, // viseme_PP
};
Hey,
First of all, really amazing work that you are doing, I came here from some of your youtube videos, very interesting stuff with RPM and Unity.
Following this tutorial I did an example of me talking to OpenAI directly using Microsoft Speech SDK.
Most of the work is done, but my lipsync is not that great, Microsoft gives me LipSync as an array with FrameIndexes:
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?pivots=programming-language-csharp&tabs=3dblendshapes#viseme-id
Do you suggest any way to use it with the Ready Player Me model?
Please ignore this if this is not something that interests you!!