pipecat-ai / pipecat

Open Source framework for voice and multimodal conversational AI
BSD 2-Clause "Simplified" License
3.32k stars 315 forks source link

BUG: Cartoonic voice after upgrading to pipecat 0.0.47 #648

Open sadimoodi opened 1 week ago

sadimoodi commented 1 week ago

After upgrading to pipecat V 0.0.47, all examples produce cartoonic voice when using xtts as TTS, it seems like there is a problem with the sample_rate? rolling abck to v 0.0.41 fixes the problem

here is my code:

import asyncio, argparse
import aiohttp
import os
import sys
from pipecat.frames.frames import EndFrame
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
    LLMAssistantResponseAggregator, LLMUserResponseAggregator)
# from pipecat.services.deepgram import DeepgramSTTService, DeepgramTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.ollama import OLLamaLLMService
from pipecat.services.xtts import XTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer #, VADParams
from loguru import logger

from dotenv import load_dotenv
load_dotenv(override=True)

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

async def main(room_url, token):

    async with aiohttp.ClientSession() as session:
        logger.info("Just started bot")
        transport = DailyTransport(
            room_url,
            token,
            "Test",
            DailyParams(
                audio_out_enabled=True,
                transcription_enabled=True,
                vad_enabled=True,
                  vad_analyzer=SileroVADAnalyzer() #(params=VADParams(stop_secs=float(os.getenv("VAD_STOP_SECS", "0.3")))),
            ))

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_API_KEY"),
            model="gpt-4o-mini")

        tts = XTTSService(
                aiohttp_session=session,
                voice_id="Brenda Stern", #"Claribel Dervla"
                language="en",
                base_url="http://localhost:8001"
            )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful Assistant, say Hello"
            },
        ]

        tma_in = LLMUserResponseAggregator(messages)
        tma_out = LLMAssistantResponseAggregator(messages)

        pipeline = Pipeline([
            transport.input(),   # Transport user input
            tma_in,              # User responses
            llm,                 # LLM
            tts,                 # TTS
            transport.output(),  # Transport bot output
            tma_out              # Assistant spoken responses
        ])

        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True, enable_metrics=True))

        @transport.event_handler("on_first_participant_joined")
        async def on_first_participant_joined(transport, participant):
            transport.capture_participant_transcription(participant["id"])
            logger.info("First participant joined")
            messages.append({"role": "system", "content": "Please introduce yourself to the user."})
            await task.queue_frames([LLMMessagesFrame(messages)])

        @transport.event_handler("on_participant_joined")
        async def on_participant_joined(transport, participant):
            transport.capture_participant_transcription(participant["id"])
            logger.info("participant joined")

        @transport.event_handler("on_participant_left")
        async def on_participant_left(transport, participant, reason):
            await task.queue_frame(EndFrame())
            logger.info("Partcipant left. Exiting.")

        @transport.event_handler("on_call_state_updated")
        async def on_call_state_updated(transport, state):
            logger.info("Call state %s " % state)
            if state == "left":
                await task.queue_frame(EndFrame())

        runner = PipelineRunner()
        await runner.run(task)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="RTVI Bot Example")
    parser.add_argument("-u", type=str, help="Room URL")
    parser.add_argument("-t", type=str, help="Token")
    #parser.add_argument("-c", type=str, help="Bot configuration blob")
    p_config = parser.parse_args()

    #bot_config = json.loads(config.c) if config.c else {}
    #logger.warning()
    if p_config.u and p_config.t:
        asyncio.run(main(p_config.u, p_config.t))
    else:
        logger.error("Room URL and Token are required")
aconchillo commented 1 day ago

Yes, sample rates have been upgraded to 24000 by default. This is not released yet, but if you try the examples using Pipecat from main (instead of installed version) everything should work fine.

sadimoodi commented 20 hours ago

Yes, sample rates have been upgraded to 24000 by default. This is not released yet, but if you try the examples using Pipecat from main (instead of installed version) everything should work fine.

Thank you @aconchillo when is the bug fix going to be released? I am not sure i got ur point but i am using pip install pipecat-ai [options] and then copying examples from the main branch already, Version 0.0.41 is the only version that works for me where the xtts service produces normal voice.

geekofycoder commented 17 hours ago

you can set the audio_out_sample_rate=24000 in DailyParams if using OpenAITTSService

sadimoodi commented 17 hours ago

you can set the audio_out_sample_rate=24000 in DailyParams if using OpenAITTSService

i am not using the OpenAITTSService rather XTTSService, in OpenAITTSService the constructor already contains 24000 sample rate:

    def __init__(
        self,
        *,
        api_key: str | None = None,
        voice: str = "alloy",
        model: Literal["tts-1", "tts-1-hd"] = "tts-1",
        sample_rate: int = 24000,

changing the sample rate in the XTTSService to 24000 doesnt solve the problem either. setting audio_out_sample_rate=24000 in DailyParams class doesnt work. something else might be wrong here?

sadimoodi commented 16 hours ago

setting audio_out_sample_rate=31_000 almost solved the problem (voice isnt same like before but understandable), setting audio_out_samplerate=32000 without the "" wont work, suggest to accept both formats (int and Literal)