met4citizen / TalkingHead

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.
MIT License
292 stars 94 forks source link

ElevenLabs backend issue #11

Closed Arsal-R closed 6 months ago

Arsal-R commented 6 months ago

Hello, I hope you're doing well. I'm facing an issue with some code. My primary programming language is Python, so I'm not very familiar with Node.js. From what I've gathered from GitHub and the existing code, it seems we need a backend server to process requests and return responses. I set up a simple Flask server for this purpose.

I also developed a backend function to handle requests and confirmed its functionality through separate testing - it worked. However, when I integrated it with the frontend, I received neither output nor error messages. Further investigation revealed that the function requires an 'alignments' dictionary detailing the start time, word, and total duration of the word in the audio. I've adjusted the backend to accommodate this.

After making these changes, I tried again and received audio output, but it was just white noise, not the expected word-related audio. I'm not using Nginx or Apache2 servers. Then I again tried some changed to the function which on checking I am now getting no error and no audio.

Thank you so much for your assistance in advance.

Here are some code samples:

Backend flask function:

def elevenSpeak_adapted(text):
    CHUNK_SIZE = 1024
    url = "https://api.elevenlabs.io/v1/text-to-speech/EXAVITQu4vr4xnSDxMaL"

    headers = {
        "Accept": "audio/mpeg",
        "Content-Type": "application/json",
        "xi-api-key": "49ed4d0f9499e7a5d26339731f8a16cd"
    }

    data = {
        "text": text,
        "model_id": "eleven_monolingual_v1",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.5
        }
    }

    chunks = b""
    chunks_list = []
    response = requests.post(url, json=data, headers=headers)

    for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
        chunks += chunk
        chunks_list.append(chunk)

    file_name = 'output_audio.mp3'
    with open(file_name, 'wb') as audio_file:
        audio_file.write(chunks)

    base64_audio = base64.b64encode(chunks).decode("utf-8")
    audio_bytes = base64.b64decode(base64_audio)
    audio_buffer = io.BytesIO(audio_bytes)

    audio = AudioSegment.from_file(audio_buffer)
    duration_ms = len(audio)

    print(f"Audio duration: {duration_ms} milliseconds")
    ALIGNMENTS = estimate_word_timings_enhanced(text, duration_ms)
    ALIGNMENTS = {
        "audio":[],
        "words":ALIGNMENTS[0],
        "wtimes":ALIGNMENTS[1],
        "wdurations":ALIGNMENTS[2]
    }

    print("Audio data returned")
    return base64.b64encode(chunks).decode("utf-8"), ALIGNMENTS

@socketio.on('speak_request')
def handle_speak_request(json_data):
    text = json_data.get('text')
    print("Got direct speak request!")

    if text:
        audio_data, ALIGNMENTS = elevenSpeak_adapted(text)
        print("Returning the audio data!")
        emit('speak_response', {'audio_data': audio_data, 'ALIGNMENTS':ALIGNMENTS})

Frontend Js function:

async function elevenSpeak(s, node=null) { if (!elevenSocket) { elevenInputMsgs = [ elevenBOS, { "text": s, "try_trigger_generation": true, } ];

let url = location.protocol + '//' + document.domain + ':' + '5000';
console.log("Sending request to " + url);
console.log(cfg('voice-lipsync-lang'))
console.log(cfg('voice-eleven-id'))

// Make the connection
elevenSocket = io.connect(url);

// Connection opened
elevenSocket.on("connect", function(){
  console.log("Socket is opened")
  elevenOutputMsg = null;
  while (elevenInputMsgs.length > 0) {
    console.log(`Sending to servers through socket ${JSON.stringify(elevenInputMsgs.shift())}`)
    elevenSocket.emit("speak_request", elevenInputMsgs.shift());
  }
});

// New message received
elevenSocket.on("speak_response", function (r) {
  console.log("Received message")
  console.log(r)

  // Speak audio
  if ((r.isFinal || r.normalizedAlignment) && elevenOutputMsg) {
    console.log("r.isFinal || r.normalizedAlignment) && elevenOutputMsg")
    head.speakAudio(elevenOutputMsg, { lipsyncLang: cfg('voice-lipsync-lang') }, node ? addText.bind(null, node) : null);
    elevenOutputMsg = null;
  }

  if (!r.isFinal) {
    // New part
    console.log(1)
    if (r.alignment) {
      elevenOutputMsg = { audio: [], words: [], wtimes: [], wdurations: [] };
      console.log(2)

      // Parse chars to words
      let word = '';
      let time = 0;
      let duration = 0;
      for (let i = 0; i < r.alignment.chars.length; i++) {
        if (word.length === 0) time = r.alignment.charStartTimesMs[i];
        if (word.length && r.alignment.chars[i] === ' ') {
          elevenOutputMsg.words.push(word);
          elevenOutputMsg.wtimes.push(time);
          elevenOutputMsg.wdurations.push(duration);
          word = '';
          duration = 0;
        } else {
          print(duration)
          duration += r
          duration += r.alignment.charDurationsMs[i];
          word += r.alignment.chars[i];
        }
      }
      // Add the last word if it's not empty
      if (word.length) {
        elevenOutputMsg.words.push(word);
        elevenOutputMsg.wtimes.push(time);
        elevenOutputMsg.wdurations.push(duration);
      }
    }

    // Add audio content to message
    if (r.audio && elevenOutputMsg) {
      console.log(r.audio)
      console.log(elevenOutputMsg)
      elevenOutputMsg.audio.push(head.b64ToArrayBuffer(r.audio));
    }
  }
});

elevenSocket.on("disconnect", (reason) => {
  if (reason === 'io server disconnect') {
      console.log("Socket connection has been closed by the server");

    } else {
      console.warn('Connection died', reason);
  }
  elevenSocket = null;

});

elevenSocket.on("connect_error", (error) => { console.error("Connection error:", error); });

} else { // If the socket is already open, send the message directly let msg = { "text": s, "try_trigger_generation": s.length > 0 }; elevenSocket.emit("speak_request", msg); } }



Here is the output from all console.log in console:

![image](https://github.com/met4citizen/TalkingHead/assets/162581671/f2e0a534-cf4d-4663-b8f4-9adb3aa8e23f)

Please know that I removed the jwt from frontend wherever it was required or used.

Hope that you can help with this issue.
Thank u so much
Arsal
met4citizen commented 6 months ago

Hi. I have limited knowledge of Python, and I have never used Flask, so please forgive me if I've misunderstood any basic concepts here.

First of all, you can certainly call ElevenLabs API directly from the JavaScript client-side code, so you don't need your own backend server for that. The main reason why you might want to implement your own ElevenLabs server-side proxy is that you don't have to put your private API key into your client-side code for everyone to see and misuse.

For accurate lip-sync you need word-to-audio alignment. The only way to get that information from ElevenLabs is to use their WebSockets API. Using ElevenLabs standard HTTP API - as your backend is doing now - and some different service for generating timestamps, is not going to give you good results. Also note that the ElevenLabs WebSockets API is a real-time streaming API. Instead of the typical request-response approach, you stream your text input and the service streams its audio/timestamp output in real-time.

The white noise you encountered suggests that the audio data that your code provided to the speakAudio method had some issues. The value of the audio property should be either a JavaScript AudioBuffer or an array of Base64 encoded PCM 16bit LE audio chunks with the correct sample rate (the TalkingHead's default sample rate is 22050, but you can change this with the global option pcmSampleRate). You seem to be using the latter approach, but the default audio format ElevenLabs uses is mp3_44100_128. To fix this, you can set the requested ElevenLabs audio format to pcm_22050. But as I said earlier, you should really be using their WebSockets API.

Unfortunately, I have limited knowledge about Python/Flask, so I'm unable to provide more detailed instructions, such as how to implement WebSocket proxies, but I hope this helps you one step further.

Arsal-R commented 6 months ago

Hi, thank u so much for your quick reply.

I see, so we can directly run from the frontend. Sorry, as I also have limited knowledge of js/Nodejs can you tell how can I do that? Like what changes do I have to do to run it? where to place API, etc?

Thank u

met4citizen commented 6 months ago

Sure, I can give you detailed instructions on how to modify the index.html example app so that it calls the ElevenLabs WebSocket API directly. I must, however, point out that you should never put your private API key in any client-side code unless you are sure that you are the only user and the only one having access to that code. What you really should do is find out how to make a WebSocket proxy. But, like I said, I can't help you with that because I have never used Flask.

  1. In the index.html file there is the URL template for the ElevenLabs API proxy called elevenTTSProxy. Change it to point directly at the ElevenLabs WebSocket API endpoint:
const elevenTTSProxy = [
  "wss://api.elevenlabs.io",
  "/v1/text-to-speech/",
  "/stream-input?model_id=eleven_multilingual_v2&output_format=pcm_22050"
];
  1. Add your ElevenLabs API key to the Beginning of Stream message elevenBOS:
const elevenBOS = {
  "text": " ",
  "voice_settings": { "stability": 0.8, "similarity_boost": true },
  "generation_config": {
    "chunk_length_schedule": [500,500,500,500]
  },
  "xi_api_key": "<insert-your-api-key-here>"
};
  1. Remove the JSON Web Token from the URL:
let url = elevenTTSProxy[0];
// url += await jwtGet();
url += elevenTTSProxy[1];
url += cfg('voice-eleven-id');
url += elevenTTSProxy[2];

That's it.

kelsoooooo commented 6 months ago

Sure, I can give you detailed instructions on how to modify the index.html example app so that it calls the ElevenLabs WebSocket API directly. I must, however, point out that you should never put your private API key in any client-side code unless you are sure that you are the only user and the only one having access to that code. What you really should do is find out how to make a WebSocket proxy. But, like I said, I can't help you with that because I have never used Flask.

  1. In the index.html file there is the URL template for the ElevenLabs API proxy called elevenTTSProxy. Change it to point directly at the ElevenLabs WebSocket API endpoint:
const elevenTTSProxy = [
  "wss://api.elevenlabs.io",
  "/v1/text-to-speech/",
  "/stream-input?model_id=eleven_multilingual_v2&output_format=pcm_22050"
];
  1. Add your ElevenLabs API key to the Beginning of Stream message elevenBOS:
const elevenBOS = {
  "text": " ",
  "voice_settings": { "stability": 0.8, "similarity_boost": true },
  "generation_config": {
    "chunk_length_schedule": [500,500,500,500]
  },
  "xi_api_key": "<insert-your-api-key-here>"
};
  1. Remove the JSON Web Token from the URL:
let url = elevenTTSProxy[0];
// url += await jwtGet();
url += elevenTTSProxy[1];
url += cfg('voice-eleven-id');
url += elevenTTSProxy[2];

That's it.

Hi there, thank you so much again for your kind help and prompt followup, which helped a lot indeed! We have tried your approach and sucuesslly get it done! :)

However, we encountered another issue that the sound from Elevenlabs always has some unnatural pauses every 3 to 5 seconds. It seems to us that the voice synthesis by Elevenlabs cannot catch up with the text generation by ChatGPT. Here's a video example of the issue: https://youtu.be/nSbze3ITL94

We are wondering whether the issue is attributed to the real time streaming approach? Maybe it could be resolved if we only start the voice synthesis process with Elevanlabs after the whole text is generated by ChatGPT, but we are not sure how to make such a change, because we know that the whole project is built based on this setting.

It should be very much appreciated if you have any idea about how to resolve the issue. And again, thank you so much for this wonderful project and your kind help with every issues that we encountered along the way! Have a nice weekend! 👍

met4citizen commented 6 months ago

Thanks!

Based on some posts in their own forum, the gaps between audio chunks are a well-known issue with the ElevenLabs WebSocket API. The audio chunks they send are not seamless, and as far as I know, there isn't much we can do about it. It seems that they have had to make some compromises with the audio quality to decrease latency. I hope they can fix the issue because only their WebSockets API can provide time-to-audio alignment information.

The gaps aren't the only problem with ElevenLabs. There is also some variance in the tone of the voices, and in longer texts, the voice volume sometimes starts to drop towards the end. The biggest issue for me, however, is the price. In my own use case of the TalkingHead class, I now use Google TTS for free, whereas the same amount of characters in ElevenLabs would cost me $330 per month.

met4citizen commented 6 months ago

I just watched the video you linked again. It really shouldn't sound that bad. I think we can improve it by disabling the ElevenLabs' automatic try_trigger_generation and flushing the generation manually after each sentence. This way, the gaps occur mostly between sentences, and you hardly notice them.

I just pushed a quick fix to the index.html file. Just three lines. I hope it helps. You can still hear some gaps in longer sentences, but it should be much better now.

kelsoooooo commented 6 months ago

Thank you so much! It works like a charm, and the voice now sounds very fluent and natural! Of course, it is very expensive too just like what you said hahaha...

Anyway, very much appreciate your prompt response and kind help as always. This project provides a very good basis for us to explore the possibilities of AI chatbot, we are so grateful for your huge support with this free and open sourced project. Wish you every success!