met4citizen / TalkingHead

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.
MIT License
349 stars 107 forks source link

Jwt for flask service. #51

Closed hernanjls closed 1 month ago

hernanjls commented 3 months ago

Hi I am newbie in this library... I am trying to create the proxy for can usage the library in flask over nginx but I dont can work this....

the app is the follwing ....

from flask import Flask, render_template, request, jsonify
from flask_jwt_extended import JWTManager, create_access_token
import requests
import os  
from dotenv import load_dotenv  # Para cargar variables de entorno desde .env

app = Flask(__name__)
app.config['SECRET_KEY'] = 'secreto123456'
app.config['JWT_SECRET_KEY'] = 'secreto123456'
jwt = JWTManager(app)

@app.route('/token', methods=['POST'])
def token():
    access_token = create_access_token(identity="guest")
    return jsonify(access_token=access_token)

@app.route('/')
def main():
    return render_template('main.html')

@app.route('/gtts', methods=['POST'])
def gtts_proxy():
    token = request.headers.get("Authorization")
    headers = {'Authorization': token} if token else {}
    json_data = request.get_json()

    response = requests.post(
        'https://eu-texttospeech.googleapis.com/v1beta1/text:synthesize?key=the-api-key',
        headers=headers,
        json=json_data
    )

    if response.status_code != 200:
        return jsonify({"error": "Error en la solicitud a Google TTS", "details": response.json()}), response.status_code

    return jsonify(response.json()), response.status_code

if __name__ == '__main__':
    load_dotenv()
    app.run(debug=True, host='0.0.0.0', port=5150)

the html/javascript is following main.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Main Page</title>
    <link rel="stylesheet" href="/static/css/styles.css">  <!-- Enlaza tu archivo CSS -->
</head>
<body>
  <h1>TalkingHead Avatar</h1>
  <div id="avatar"></div>
  <div id="controls">
    <input id="text" type="text" value="Hi there. How are you? I'm fine.">
    <input id="speak" type="button" value="Speak">
  </div>
  <div id="loading"></div>

  <script type="importmap">
  { "imports":
    {
      "three": "https://cdn.jsdelivr.net/npm/three@0.161.0/build/three.module.js/+esm",
      "three/addons/": "https://cdn.jsdelivr.net/npm/three@0.161.0/examples/jsm/",
      "talkinghead": "https://cdn.jsdelivr.net/gh/met4citizen/TalkingHead@1.2/modules/talkinghead.mjs"
    }
  }
  </script>
    <script type="module">
        import { TalkingHead } from "talkinghead";

        async function jwtGet() {
           const response = await fetch('/token', { method: 'POST' });
           const data = await response.json();
           const token = data.access_token;
           //alert(token);  // show token
           return token;
        }

        // Inicializar el avatar cuando el DOM esté cargado
        document.addEventListener('DOMContentLoaded', async () => {

        const nodeAvatar = document.getElementById('avatar');
            const head = new TalkingHead(nodeAvatar, {
                ttsEndpoint: "/gtts/",
                jwtGet: jwtGet,
                cameraZoomEnable: true,
                cameraPanEnable: true,
                cameraView: 'full',
                lipsyncModules: ["en", "fi"]
            });

        // Load and show the avatar
      const nodeLoading = document.getElementById('loading');
      try {
        nodeLoading.textContent = "Loading...";
        await head.showAvatar( {
          url: 'https://models.readyplayer.me/64bfa15f0e72c63d7c3934a6.glb?morphTargets=ARKit,Oculus+Visemes,mouthOpen,mouthSmile,eyesClosed,eyesLookUp,eyesLookDown&textureSizeLimit=1024&textureFormat=png',
          body: 'F',
          avatarMood: 'neutral',
          ttsLang: "en-GB",
          ttsVoice: "en-GB-Standard-A",
          lipsyncLang: 'en'
        }, (ev) => {
          if ( ev.lengthComputable ) {
            let val = Math.min(100,Math.round(ev.loaded/ev.total * 100 ));
            nodeLoading.textContent = "Loading " + val + "%";
          }
        });
        nodeLoading.style.display = 'none';
      } catch (error) {
        console.log(error);
        nodeLoading.textContent = error.toString();
      }

      // Speak when clicked
      const nodeSpeak = document.getElementById('speak');
      nodeSpeak.addEventListener('click', function () {
        try {
          const text = document.getElementById('text').value;
          if ( text ) {
            head.speakText( text );
          }
        } catch (error) {
          console.log(error);
        }
      });

        });

    </script>
</body>
</html>

the nginx content is:

server {
    listen 80;
    server_name mydomain.net;

    location / {
        proxy_pass http://127.0.0.1:5150;  # Puerto donde Gunicorn está escuchando
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_redirect off;
    }

    # Proxy para la API de Google TTS
    location /gtts/ {
        proxy_pass http://127.0.0.1:5150/gtts/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

     # Ubicación para servir archivos estáticos
    location /static/ {
        alias /var/www/mydomain/static/;  #static files
    }

    location ~* \.mjs$ {
        add_header Content-Type application/javascript;
    }

    listen 443 ssl;
    ssl_certificate /etc/letsencrypt/live/mydomain.net/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/mydomain.net/privkey.pem;

}

I dont expert in flask but I think that the config file for nginx is ok? the output page is working and load the avatar

image

but when i try to do speak this happen.. in the console of the google chrome

image

please help me to fix the errors to can allow me to work with this library

met4citizen commented 3 months ago

Hi.

Does the 404 (Not Found) error for the /gtts/ request originate from the nginx server, the Flask server, or the Google TTS server? - You can probably determine this by checking the Network tab in your browser's developer tools, where you can examine the actual request and response headers, among other details.

If the 404 error is coming from either nginx or Flask, it might be due to some configuration issue. Unfortunately, I'm not familiar with nginx or Flask, so I can't really say what that issue might be. In principle, your configuration seems alright to me.

If the 404 error is from the Google TTS server, here are a couple of observations. First, you should not send the Authorization header with a JWT to the Google TTS server. It is sufficient to include your Google TTS API key in the URL. The JWT should only be used to verify that the end-user is authorized to use your Google TTS API key for the request, based on the information specified in the token (such as username, expiration time, etc.).

In your token method, it seems you return a JSON string. However, according to the standard, a JSON Web Token consists of three Base64-URL encoded strings separated by dots. I'm not sure what you plan to do with your code eventually, but typically, once the user is identified via SSO, you would return a JWT that grants the user restricted and time-limited rights to make actual API calls. For more information about JWT, refer to jwt.io

hernanjls commented 3 months ago

thanks for answering, I like your component and I will keep trying to make it work in my work environment, since it is in pure javascript without using react or other libs, it would be useful for me to put it in any web client environment... I generally program in c# for mobile, but I do not regularly use programming languages ​​on the server, I know a little about Flask because I have used it to learn a little about Gemini

The 404 error appears to be not working properly when trying to connect to the endpoint. but since I want to understand how your component works, I have been looking for how to implement the server part that you mention.

Would it be too much to ask if you can show the code associated with the rest service that you use as a proxy to implement the JWT? What language did you do it in? php with apache? or what is required to make the component work with gtts elevenlabs or other component? I understand that these libraries are required only to obtain the audio from a text or what else do you use it for?

met4citizen commented 3 months ago

You can read the outline of my own Apache2/JWT/SSO setup in the README, Appendix B.

For JWT specifically, I currently use the jwt-cli CLI tool along with shell scripts. However, there are many different JWT tools and libraries available for various programming languages (C#, PHP, Python, etc.). See https://jwt.io/libraries

I won't include my CGI scripts here due to security reasons. However, the general idea is that my get JWT CGI script generates and encodes a new token using a CLI tool, then returns the token. For example (not a real token):

{ "jwt": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c" }

When the token—that three-part Base64-URL encoded string—is used in an API proxy request, the jwtverify CGI script extracts it from the Authorization header and uses the CLI tool to decode it. If the token is valid, not expired, etc., the script allows the proxy pass. The information you include in the token and what you check are up to you.

Yes, Google TTS and ElevenLabs are text-to-speech services. In addition to audio, they provide word-to-audio timestamps, which are essential for accurate lip-sync. If you want to use JWT + ElevenLabs, there is an Apache2 configuration example for the ElevenLabs WebSocket API in Appendix B. You can also find a client-side code example of how to use the class with ElevenLabs TTS in the test app index.html. See the methods jwtGet and elevenSpeak.

met4citizen commented 3 months ago

I'm curious, did you manage to make it work?

hernanjls commented 3 months ago

Hello, notice that I haven't taken it up again, it's just that I was looking for a component like this to make it work for an experiment I did on AI and I wanted to integrate it with a talking avatar, in the end what I did was use some examples with react, but the lipsync still I haven't been able to make it work, anyway I want to give me time to fully understand the use of your component that seems more robust to me, just give me a little time to examine it better, if you want you can see my experiment here...

https: //avatar.virtualisimo.net/,

I hope to be able to continue exploring your component in a few more days if you can help to me for that I aprecciate very much your help, I maybe I can help you to best this

met4citizen commented 3 months ago

Thanks for the update — I enjoyed your upbeat demo!

Regarding the demo, if you haven't noticed, I have a short code example called mp3.html in the examples directory that can make an avatar lip-sync to any audio file. It uses OpenAI's Whisper to transcribe the audio and obtain word timestamps. The code then uses the TalkingHead's speakAudio method instead of speakText, which eliminates the need for any TTS engine. You can watch a related demo video here (the screen capture is from the test app index.html, but the relevant code is more or less the same as in mp3.html).

If your own audio analysis already provides word-level timestamps, integrating lip-sync would be quite straightforward without any TTS or API keys.

In your demo, the key highlight was, of course, the sentiment analysis, which I also found interesting. I've conducted some sentiment analysis experiments using GPT-4 with function calling, dynamically altering the avatar's mood or triggering animations—though only with text input/output. This was the first time I've seen it applied to a song.

met4citizen commented 1 month ago

I will close this issue, but feel free to reply if you have further questions.