uezo / aiavatarkit

🥰 Building AI-based conversational avatars lightning fast ⚡️💬
Apache License 2.0
261 stars 22 forks source link

Provide examples to use alternative Text-to-Speech services #26

Open Fu-u0718 opened 7 months ago

Fu-u0718 commented 7 months ago

アバターと日本語だけではなく、英語での会話も行ってみたいと考えているのですが、コードで使用しているVOICEVOXは英語が話せないと知りました。例えば、GoogleやAzureのText-to-Speechを使用するなどして組んだプログラムはお作りになっていませんか?

uezo commented 7 months ago

@Fu-u0718 says: I'm interested in having conversations not only in Japanese using an avatar, but also in English. However, I found out that the VOICEVOX software used in the code does not support English. Have you created any programs that utilize Text-to-Speech services like Google or Azure for this purpose?

uezo commented 7 months ago

Hi @Fu-u0718 , You can make custom SpeechController that is based on TTS services you like.

  1. Make SpeechController that implements aiavatar.speech.SpeechController
  2. Set the instance of your custom SpeechController to AvatarController

Here is an example for Azure:

  1. Make AzureSpeechController
import aiohttp
import asyncio
import io
from logging import getLogger, NullHandler
import traceback
import wave
import numpy
import sounddevice
from . import SpeechController

class VoiceClip:
    def __init__(self, text: str):
        self.text = text
        self.download_task = None
        self.audio_clip = None

class AzureSpeechController(SpeechController):
    def __init__(self, api_key: str, region: str, speaker_name: str="ja-JP-AoiNeural", speaker_gender: str="Female", lang="ja-JP", device_index: int=-1, playback_margin: float=0.1):
        self.logger = getLogger(__name__)
        self.logger.addHandler(NullHandler())

        self.api_key = api_key
        self.region = region
        self.speaker_name = speaker_name
        self.speaker_gender = speaker_gender
        self.lang = lang

        self.device_index = device_index
        self.playback_margin = playback_margin
        self.voice_clips = {}
        self._is_speaking = False

    async def download(self, voice: VoiceClip):
        url = f"https://{self.region}.tts.speech.microsoft.com/cognitiveservices/v1"
        headers = {
            "X-Microsoft-OutputFormat": "riff-16khz-16bit-mono-pcm",
            "Content-Type": "application/ssml+xml",
            "Ocp-Apim-Subscription-Key": self.api_key
        }
        ssml_text = f"<speak version='1.0' xml:lang='{self.lang}'><voice xml:lang='{self.lang}' xml:gender='{self.speaker_gender}' name='{self.speaker_name}'>{voice.text}</voice></speak>"
        data = ssml_text.encode("utf-8")

        async with aiohttp.ClientSession() as session:
            async with session.post(url, headers=headers, data=data) as response:
                if response.status == 200:
                    voice.audio_clip = await response.read()

    def prefetch(self, text: str):
        v = self.voice_clips.get(text)
        if v:
            return v

        v = VoiceClip(text)
        v.download_task = asyncio.create_task(self.download(v))
        self.voice_clips[text] = v
        return v

    async def speak(self, text: str):
        voice = self.prefetch(text)

        if not voice.audio_clip:
            await voice.download_task

        with wave.open(io.BytesIO(voice.audio_clip), "rb") as f:
            try:
                self._is_speaking = True
                data = numpy.frombuffer(
                    f.readframes(f.getnframes()),
                    dtype=numpy.int16
                )
                framerate = f.getframerate()
                sounddevice.play(data, framerate, device=self.device_index, blocking=False)
                await asyncio.sleep(len(data) / framerate + self.playback_margin)

            except Exception as ex:
                self.logger.error(f"Error at speaking: {str(ex)}\n{traceback.format_exc()}")

            finally:
                self._is_speaking = False

    def is_speaking(self) -> bool:
        return self._is_speaking
  1. Set the instance of your custom SpeechController to AvatarController
app.avatar_controller.speech_controller = AzureSpeechController(
    AZURE_SUBSCRIPTION_KEY, AZURE_REGION,
    speaker_name="en-US-AvaNeural",
    speaker_gender="Female",
    lang="en-US",
    device_index=2    # Set output device number on you PC
)

However, I've found that AIAvatar has an issue handling English responses from ChatGPT. I will fix it soon.

uezo commented 7 months ago

I've fixed it👍 https://github.com/uezo/aiavatarkit/pull/32

Fu-u0718 commented 7 months ago

thank you! You will learn a lot. I would also like to enjoy conversation in English. Thank you for taking the time out of your busy schedule to respond!

mosu7 commented 4 months ago

Hi I tried with openai speech service, however it got stucked on [INFO] 2024-07-15 17:28:44,009 : Listening... (OpenAIWakewordListener)

uezo commented 4 months ago

Hi @mosu7, Thank you for your post but we are discussing about Text-to-Speech in this issue, not wake word listener. Make another issue if you want discuss about it.