met4citizen / TalkingHead

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.
MIT License
202 stars 69 forks source link

SpeakText #18

Closed tanerdogan closed 3 months ago

tanerdogan commented 3 months ago

Hi, thanks for great project first. I've a silly question. I just want a lipsync (actually just need a speaking animation no need sync for now) tried head.speakText("Hello how are you today"); but no luck.

Also tried like that

                var vv = new LipsyncEn();
                var vis = vv.wordsToVisemes('Hello, how are you today?');
                console.log(vis);

                for (let j = 0; j < vis.visemes.length; j++) {
                    console.log(vis.times[j]);
                    head.setFixedValue("viseme_" + vis.visemes[j], vis.durations[j]);
                };

I am not JS dev, is there any solution? Thanks

JPhilipp commented 3 months ago

Open your Chrome console by hitting F12 and try again -- please paste the error into a reply here, if any. Cheers!

tanerdogan commented 3 months ago

Hi again, there is no error at all... maybe speakText method not working like that?

                console.log(head.getMoodNames());
                console.log(head.getMorphTargetNames());

                head.speakText("Hello how are you today");
                head.startSpeaking();
Screenshot 2024-04-03 at 17 05 27
met4citizen commented 3 months ago

If you want the avatar to speak and lip-sync some text, simply calling the speakText method should work. There is no need to use the lip-sync module directly or call other methods.

Have you tried the minimal code example? - If you haven't, download the ./examples/minimal.html, add your own Google TTS API key, and test.

tanerdogan commented 3 months ago

@met4citizen thanks for reply. I think i should explain more because i am not using Google TTS, I have my own backend which is work with OpenAI PHP and using that for chat, TTS and STT APIs. And using HowlerJS for audio...

So I create TalkingHead like that

  head = new TalkingHead( nodeAvatar, {
    ttsEndpoint: "no.json",
    ttsApikey: "", // <- Change this
    cameraView: "upper"
  });

and my no.json like this, { "words": "HELLO, HOW ARE YOU TODAY?", "visemes": [ "I", "E", "nn", "O", "I", "aa", "RR", "I", "U", "DD", "aa", "DD", "E" ], "times": [ 0, 0.92, 1.82, 2.7, 7.66, 8.58, 11.194999999999999, 13.075, 13.995, 15.944999999999999, 16.994999999999997, 17.944999999999997, 18.994999999999997 ], "durations": [ 0.92, 0.9, 0.88, 0.96, 0.92, 1.6149999999999998, 0.88, 0.92, 0.95, 1.05, 0.95, 1.05, 0.9 ], "i": 25 }

So for my project i just need startSpeak (random words no need lip sync) with Audio onStart and stopSpeak w audio onEnded.

BTW -- tested mp3.html too its works perfect and i copied response json and paste into no.json but still no luck.

Maybe I'm asking stupid question but still learning and trying to put the pieces together -)

met4citizen commented 3 months ago

Thanks, now I understand a bit more what you are trying to do. Since speakText method always uses Google TTS, the best starting point here is the mp3.html code example. It uses speakAudio method and requires no TTS.

Here is a simple example that replaces the "load" button click event handler with a "manual" call to speakAudio. Instead of silence you can, of course, use some actual audio content. Times and durations are specified in milliseconds:

      // Load button clicked
      const nodeLoad = document.getElementById('load');
      nodeLoad.addEventListener('click', function () {
        // Create an empty audio buffer of length 1 seconds
        const audioCtx = new AudioContext();
        const audiobuffer = audioCtx.createBuffer(2, 22050, 22050);

        // Speak audio
        head.speakAudio({
          audio: audiobuffer,
          words: ["HELLO,","HOW","ARE","YOU","TODAY?"],
          wtimes: [20,500,520,640,740],
          wdurations: [320,20,110,100,240],
          markers: [],
          mtimes: []
        });
      });

I hope this helps. Feel free to ask further questions, if any.

tanerdogan commented 3 months ago

@met4citizen ♥️ Thanks for help, now i had progress and wanna share here. Just wonder why wordsToVisemes method response 22 times & durations value? Must be 7 for wordsToVisemes('Hello, how are you today my darling?'); and values are much smaller. Can I use any other method instead of wordsToVisemes?

                const audioCtx = new AudioContext();
                const audiobuffer = audioCtx.createBuffer(2, 22050, 22050);

                var vv = new LipsyncEn();
                var vis = vv.wordsToVisemes('Hello, how are you today my darling?');

                var wo = vis.words.split(" ");
                var wt = vis.times.map(i => i * 400);
                var wd = vis.durations.map(i => i * 800);

                head.speakAudio({
                    audio: audiobuffer,
                    words: wo,
                    wtimes: wt,
                    wdurations: wd,
                    markers: [],
                    mtimes: []
                });

Thanks again...

met4citizen commented 3 months ago

The typical use case for speakAudio is that you get both the audio and word timestamps from some external TTS engine, such as ElevenLabs or Microsoft Speech SDK. In an alternative use case, you already have some audio recording and you call some kind of transcription service to get the word timestamps for that audio file. In both cases, you finally call speakAudio, and the class internally breaks each word in the array into visemes and viseme timestamps by using the lip-sync module's wordsToVisemes method.

You don't need to call the method wordsToVisemes. It doesn't return word timestamps, it gives visemes and viseme timestamps for words. The timestamps it gives are in relative units, not in seconds or milliseconds.

Now, if you don't plan to use any audio file(?) and only want to make the lips move, you can just estimate the full duration of the sentence and do the following:

      // Load button clicked
      const nodeLoad = document.getElementById('load');
      nodeLoad.addEventListener('click', function () {
        // Estimate duration of the sentence in seconds
        let duration = 1.0;

        // Create an empty audio buffer of appropriate length
        const audioCtx = new AudioContext();
        const audiobuffer = audioCtx.createBuffer(2, Math.round( duration * 22050), 22050);

        // Speak audio
        head.speakAudio({
          audio: audiobuffer,
          words: ["Hello, how are you today?"],
          wtimes: [0],
          wdurations: [duration * 1000],
          markers: [],
          mtimes: []
        });
      });

So, instead of multiple words, you give the class the full sentence as a single multi-part word. - I was going to add that the resulting lip-sync will not be very accurate this way, but if you don't have any audio file then how can you tell... 🙂

tanerdogan commented 3 months ago

Hello i am using audio (tts) also audio record (stt) too. Maybe i cant figure out how to do with my limited frontend knowledge... Now its works as i want for now. Fake speak starts w audio play w setInterval and clearInterval when onend...

Thanks everyone again...

https://github.com/met4citizen/TalkingHead/assets/634890/b0d60e8e-254e-43d2-bb0c-2332ea6f1545

met4citizen commented 3 months ago

Visually, that looks great - I would love to visit that place!

The lip movement is a bit extreme and the timing isn't right, but that can be fixed. - Maybe you can find some local JavaScript expert to help?

I will close this issue for now because we have drifted away from the original title/topic, but feel free to reopen or create a new issue if needed. Best of luck!