pipecat-ai / pipecat

Open Source framework for voice and multimodal conversational AI
BSD 2-Clause "Simplified" License
3.46k stars 341 forks source link

[Question] Example request. LocalAudioTransport + Whisper + llm + tts #197

Closed gaceladri closed 5 months ago

gaceladri commented 6 months ago

Hi 👋

I am having trouble running a local example that integrates LocalAudioTransport, WhisperSTTService, ElevenLabsTTSService, and OpenAILLMService.

I have successfully managed to run Whisper locally for transcription and another script that uses Eleven Labs and OpenAI for TTS and LLM services, respectively. However, I am struggling to combine these components to create a fully functional local conversation system.

To illustrate, here are the two examples I have working independently:

Example 1: Passing an LLM message to the TTS provider:

import asyncio
import os
import sys

import aiohttp
from loguru import logger
from pipecat.frames.frames import EndFrame, LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.local.audio import LocalAudioTransport

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

async def main():
    async with aiohttp.ClientSession() as session:
        transport = LocalAudioTransport(TransportParams(audio_out_enabled=True))

        tts = ElevenLabsTTSService(
            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
        )

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_API_KEY"),
            model="gpt-3.5-turbo-0125",
        )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
            },
        ]

        pipeline = Pipeline([llm, tts, transport.output()])

        task = PipelineTask(pipeline)

        async def say_something():
            await asyncio.sleep(1)
            await task.queue_frames([LLMMessagesFrame(messages), EndFrame()])

        runner = PipelineRunner()

        await asyncio.gather(runner.run(task), say_something())

if __name__ == "__main__":
    asyncio.run(main())

Example 2: Using Whisper locally:

import asyncio
import sys

from loguru import logger
from pipecat.frames.frames import Frame, TranscriptionFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.whisper import Model, WhisperSTTService
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.local.audio import LocalAudioTransport

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

class TranscriptionLogger(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, TranscriptionFrame):
            print(f"Transcription: {frame.text}")

async def main():
    transport = LocalAudioTransport(TransportParams(audio_in_enabled=True))

    stt = WhisperSTTService()

    tl = TranscriptionLogger()

    pipeline = Pipeline([transport.input(), stt, tl])

    task = PipelineTask(pipeline)

    runner = PipelineRunner()

    await runner.run(task)

if __name__ == "__main__":
    asyncio.run(main())

Despite these individual successes, I'm unable to connect the transcriptions with the LLM and have a continuous conversation. Could you provide or add an example of a fully working local setup that demonstrates how to achieve this?

Thank you!

ajram23 commented 6 months ago

I got it running, DM me and I will give you the script.

gaceladri commented 6 months ago

@ajram23 Can you paste it here? 🙏 I didn't know that I can dm someone in GitHub!

ajram23 commented 6 months ago

07-interruptible-local.py.txt Here you go! Enjoy! cc @aconchillo just in case you want to add this to the examples folder.

gaceladri commented 5 months ago

@ajram23 Thank you for your example! It's quite similar to what I have implemented. Were you able to interact with the LLM? In my case, I can see the initial message from the LLM, but I seem to have an issue with the communication between the Whisper service and the LLMUserResponseAggregator.

Here is my current code:

#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#

import asyncio
import os
import sys

import aiohttp
from loguru import logger
from pipecat.frames.frames import Frame, LLMMessagesFrame, TranscriptionFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
    LLMAssistantResponseAggregator,
    LLMUserResponseAggregator,
)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.whisper import WhisperSTTService
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.local.audio import LocalAudioTransport
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.vad.vad_analyzer import VADParams

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

class TranscriptionLogger(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, TranscriptionFrame):
            logger.debug(f"Whisper transcription: {frame.text}")

async def main():
    async with aiohttp.ClientSession() as session:
        transport = LocalAudioTransport(
            TransportParams(
                audio_in_enabled=True,
                audio_out_enabled=True,
                transcription_enabled=True,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer(params=VADParams(min_volume=0.6)),
                vad_audio_passthrough=True,
            )
        )

        stt = WhisperSTTService(no_speech_prob=0.6)

        tts = ElevenLabsTTSService(
            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice_id="2ovNLFOsfyKPEWV5kqQi",
        )

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_API_KEY"),
            model="gpt-3.5-turbo-0125",
        )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
            },
        ]

        tma_in = LLMUserResponseAggregator(messages)
        tma_out = LLMAssistantResponseAggregator(messages)

        pipeline = Pipeline(
            [
                transport.input(),  # Transport user input
                stt,  # STT
                tma_in,  # User responses
                llm,  # LLM
                tts,  # TTS
                transport.output(),  # Transport bot output
                tma_out,  # Assistant spoken responses
            ]
        )

        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

        runner = PipelineRunner()

        async def say_something():
            messages.append(
                {"role": "system", "content": "Please introduce yourself to the user."}
            )
            await task.queue_frames([LLMMessagesFrame(messages)])

        await asyncio.gather(runner.run(task), say_something())

if __name__ == "__main__":
    asyncio.run(main())

Here are my pipeline debug messages:

2024-06-02 09:25:37.876 | DEBUG    | pipecat.services.whisper:_load:67 - Loaded Whisper model
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking PipelineSource#0 -> LocalAudioInputTransport#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LocalAudioInputTransport#0 -> WhisperSTTService#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking WhisperSTTService#0 -> LLMUserResponseAggregator#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LLMUserResponseAggregator#0 -> OpenAILLMService#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking OpenAILLMService#0 -> ElevenLabsTTSService#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking ElevenLabsTTSService#0 -> LocalAudioOutputTransport#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LocalAudioOutputTransport#0 -> LLMAssistantResponseAggregator#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LLMAssistantResponseAggregator#0 -> PipelineSink#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking Source#0 -> Pipeline#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.pipeline.runner:run:29 - Runner PipelineRunner#0 started running PipelineTask#0
2024-06-02 09:25:37.940 | DEBUG    | pipecat.services.openai:_stream_chat_completions:69 - Generating chat: [{"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.", "role": "system", "name": "system"}, {"content": "Please introduce yourself to the user.", "role": "system", "name": "system"}]
2024-06-02 09:25:38.694 | DEBUG    | pipecat.services.openai:_stream_chat_completions:96 - OpenAI LLM TTFB: 0.7542369365692139
2024-06-02 09:25:38.719 | DEBUG    | pipecat.services.elevenlabs:run_tts:35 - Transcribing text: [Hello!]
2024-06-02 09:25:39.323 | DEBUG    | pipecat.services.elevenlabs:run_tts:35 - Transcribing text: [I am your helpful Legal Language Model here to assist you during this WebRTC call.]
2024-06-02 09:25:40.177 | DEBUG    | pipecat.services.elevenlabs:run_tts:35 - Transcribing text: [How can I help you today?]
2024-06-02 09:25:42.710 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:44.907 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:47.174 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  you're 
2024-06-02 09:25:48.758 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:51.009 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  I'm 
2024-06-02 09:25:53.277 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:54.880 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  I did. 
2024-06-02 09:26:13.526 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Hey.

For some reason, the transcriptions from Whisper are not being passed to the LLMUserResponseAggregator. I've added print statements inside the LLMUserResponseAggregator to check the messages, but nothing is logged after the Whisper model transcribes the speech.

I'm running this on a Mac M2.

Any insights or suggestions on what might be going wrong would be greatly appreciated!

Thank you for your help!

ajram23 commented 5 months ago

@gaceladri in my case I was able to, not sure what is going with yours.

gaceladri commented 5 months ago

@gaceladri in my case I was able to, not sure what is going with yours.

Ok, thank you for the support and feedback!

alexovai commented 4 months ago

@gaceladri I am in the same boat. I can run 06-listen-and-respond.py example, however when I change the transport from DailyTransport to LocalAudioTransport I am not able to get it working. I am running on Mac and pipecat==0.0.36 I also looked at #244 and I believe the issue is same. There is something funny going on with LocalAudioTransport.

After llm introduces and I respond I get the following message and it stops.

:25 - > ### out of the end: LLMMessagesFrame#1 2024-07-10 15:43:01.594 | DEBUG | pipecat.processors.logger:process_frame:25 - > !!! after LLM: UserStartedSpeakingFrame#0 2024-07-10 15:43:01.595 | DEBUG | pipecat.processors.logger:process_frame:25 - > @@@ out of tts: UserStartedSpeakingFrame#0 2024-07-10 15:43:01.596 | DEBUG | pipecat.processors.logger:process_frame:25 - > ### out of the end: UserStartedSpeakingFrame#0 2024-07-10 15:43:03.752 | DEBUG | pipecat.processors.logger:process_frame:25 - > !!! after LLM: UserStoppedSpeakingFrame#0 2024-07-10 15:43:03.752 | DEBUG | pipecat.processors.logger:process_frame:25 - > @@@ out of tts: UserStoppedSpeakingFrame#0 2024-07-10 15:43:03.752 | DEBUG | pipecat.processors.logger:process_frame:25 - > ### out of the end: UserStoppedSpeakingFrame#0