pipecat-ai / pipecat

Open Source framework for voice and multimodal conversational AI
BSD 2-Clause "Simplified" License
3.44k stars 335 forks source link

Audio Out Mixer Causing Ram Out of Memory and blocking the pipeline #740

Open Vaibhav-Lodha opened 5 days ago

Vaibhav-Lodha commented 5 days ago

Description

Is this reporting a bug or feature request? bug

If reporting a bug, please fill out the following:

Environment

Issue description

Provide a clear description of the issue. audio_mixer not working with websocket transport, as soon as i turn it on, it blocks my websocket connection and gets stuck on loading part of audio, further more it is causing ram to go out of memory in no time for even audios of less than 200 KB in size.

Repro steps

List the steps to reproduce the issue.

import asyncio
import os
import sys

from deepgram import LiveOptions
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.mixers.soundfile_mixer import SoundfileMixer
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask, PipelineParams
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.network.websocket_server import WebsocketServerTransport, WebsocketServerParams

load_dotenv(override=True)

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

DESIRED_SAMPLE_RATE = 8000

async def main():

    mixer = SoundfileMixer(
        sound_files={"office": "assets/office-ambience.mp3"},
        default_sound="office",
        volume=2.0,
        loop=True,
    )
    transport = WebsocketServerTransport(
        params=WebsocketServerParams(
            audio_in_channels=1,
            audio_in_enabled=True,
            audio_in_sample_rate=DESIRED_SAMPLE_RATE,
            audio_out_sample_rate=DESIRED_SAMPLE_RATE,
            audio_out_enabled=True,
            add_wav_header=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(
                params=VADParams(
                    start_secs=0.1,
                ),
            ),
            vad_audio_passthrough=True,
            audio_out_mixer=mixer,
        ),
    )

    llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o")

    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_API_KEY"),
        live_options=LiveOptions(
            language="hi",
            model="nova-2",
            sample_rate=DESIRED_SAMPLE_RATE,
        ),
    )

    tts = ElevenLabsTTSService(
                api_key=os.getenv("ELEVENLABS_API_KEY"),
                voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
            )

    messages = [
        {
            "role": "system",
            "content": "Hello, how can I help you?",
        },
    ]

    context = OpenAILLMContext(messages)
    context_aggregator = llm.create_context_aggregator(context)

    pipeline = Pipeline(
        [
            transport.input(),  # Websocket input from client
            stt,  # Speech-To-Text
            context_aggregator.user(),
            llm,  # LLM
            tts,  # Text-To-Speech
            transport.output(),  # Websocket output to client
            context_aggregator.assistant(),
        ],
    )

    task = PipelineTask(
        pipeline,
        params=PipelineParams(allow_interruptions=True, enable_metrics=True, enable_usage_metrics=True),
    )

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        # Kick off the conversation.

        messages.append(
            {"role": "system", "content": "You are a female assistant, Please introduce yourself to the user."},
        )
        await task.queue_frames([LLMMessagesFrame(messages)])

    runner = PipelineRunner()

    await runner.run(task)

if __name__ == "__main__":
    asyncio.run(main())

Expected behavior

Audio mixer and websocket should work parallely.

Actual behavior

Audio mixer blocking websocket and causing RAM to go out of memory

Logs

2024-11-20 21:37:38.730 | DEBUG    | pipecat.audio.vad.silero:__init__:114 - Loading Silero VAD model...
2024-11-20 21:37:38.765 | DEBUG    | pipecat.audio.vad.silero:__init__:136 - Loaded Silero VAD
2024-11-20 21:37:38.777 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking PipelineSource#0 -> WebsocketServerInputTransport#0
2024-11-20 21:37:38.777 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking WebsocketServerInputTransport#0 -> DeepgramSTTService#0
2024-11-20 21:37:38.777 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking DeepgramSTTService#0 -> OpenAIUserContextAggregator#0
2024-11-20 21:37:38.777 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking OpenAIUserContextAggregator#0 -> OpenAILLMService#0
2024-11-20 21:37:38.777 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking OpenAILLMService#0 -> ElevenLabsTTSService#0
2024-11-20 21:37:38.778 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking ElevenLabsTTSService#0 -> WebsocketServerOutputTransport#0
2024-11-20 21:37:38.778 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking WebsocketServerOutputTransport#0 -> OpenAIAssistantContextAggregator#0
2024-11-20 21:37:38.778 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking OpenAIAssistantContextAggregator#0 -> PipelineSink#0
2024-11-20 21:37:38.778 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking Source#0 -> Pipeline#0
2024-11-20 21:37:38.778 | DEBUG    | pipecat.processors.frame_processor:link:143 - Linking Pipeline#0 -> Sink#0
2024-11-20 21:37:38.778 | DEBUG    | pipecat.pipeline.runner:run:27 - Runner PipelineRunner#0 started running PipelineTask#0
2024-11-20 21:37:39.805 | INFO     | pipecat.services.deepgram:_connect:194 - DeepgramSTTService#0: Connected to Deepgram
2024-11-20 21:37:40.230 | DEBUG    | pipecat.audio.mixers.soundfile_mixer:_load_sound_file:106 - Loading background sound from assets/office-ambience.mp3
2024-11-20 21:37:40.313 | DEBUG    | pipecat.audio.mixers.soundfile_mixer:_load_sound_file:111 - Resampling background sound to 8000
^C^C^Z
zsh: suspended  python3 s.py