pipecat-ai / pipecat

Open Source framework for voice and multimodal conversational AI
BSD 2-Clause "Simplified" License
3.29k stars 307 forks source link

say one thing with ElevenLabsTTSService doesnt work anymore #570

Open durandom opened 2 weeks ago

durandom commented 2 weeks ago

I replaced the CartesiaTTSService with ElevenLabsTTSService in the https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/01-say-one-thing.py example, but that doesnt work anymore with 0.0.43

Here's some logging output.

❯ pipenv run python ./pipecat/01-say-one-thing.py
Loading .env environment variables...
2024-10-11 16:15:56.950 | DEBUG    | pipecat.processors.frame_processor:link:134 - Linking PipelineSource#0 -> ElevenLabsTTSService#0
2024-10-11 16:15:56.950 | DEBUG    | pipecat.processors.frame_processor:link:134 - Linking ElevenLabsTTSService#0 -> DailyOutputTransport#0
2024-10-11 16:15:56.950 | DEBUG    | pipecat.processors.frame_processor:link:134 - Linking DailyOutputTransport#0 -> PipelineSink#0
2024-10-11 16:15:56.950 | DEBUG    | pipecat.processors.frame_processor:link:134 - Linking Source#0 -> Pipeline#0
2024-10-11 16:15:56.950 | DEBUG    | pipecat.processors.frame_processor:link:134 - Linking Pipeline#0 -> Sink#0
2024-10-11 16:15:56.950 | DEBUG    | pipecat.pipeline.runner:run:27 - Runner PipelineRunner#0 started running PipelineTask#0
2024-10-11 16:15:56.950 | DEBUG    | pipecat.services.elevenlabs:_connect:301 - Language code [en] not applied. Language codes can only be used with the 'eleven_turbo_v2_5' model.
2024-10-11 16:15:57.143 | INFO     | pipecat.transports.services.daily:join:299 - Joining https://b4mad.daily.co/.....
2024-10-11 16:15:58.041 | INFO     | pipecat.transports.services.daily:on_participant_joined:523 - Participant joined f86d7746-c249-447e-9bb3-45192846b589
2024-10-11 16:15:58.786 | INFO     | pipecat.transports.services.daily:join:318 - Joined https://b4mad.daily.co/...
2024-10-11 16:15:58.786 | DEBUG    | pipecat.services.elevenlabs:run_tts:381 - Generating TTS: [Hello there, Marcel!]
2024-10-11 16:15:58.786 | DEBUG    | pipecat.transports.base_output:_bot_started_speaking:333 - Bot started speaking
2024-10-11 16:16:00.790 | DEBUG    | pipecat.transports.base_output:_bot_stopped_speaking:338 - Bot stopped speaking
^C2024-10-11 16:16:07.783 | WARNING  | pipecat.pipeline.runner:_sig_handler:51 - Interruption detected. Canceling runner PipelineRunner#0
2024-10-11 16:16:07.783 | DEBUG    | pipecat.pipeline.runner:cancel:38 - Canceling runner PipelineRunner#0
2024-10-11 16:16:07.783 | DEBUG    | pipecat.pipeline.task:cancel:118 - Canceling pipeline task PipelineTask#0
2024-10-11 16:16:07.905 | INFO     | pipecat.transports.services.daily:leave:406 - Leaving https://b4mad.daily.co/...
2024-10-11 16:16:07.933 | INFO     | pipecat.transports.services.daily:leave:415 - Left https://b4mad.daily.co/...
2024-10-11 16:16:07.934 | DEBUG    | pipecat.pipeline.runner:run:31 - Runner PipelineRunner#0 finished running PipelineTask#0

And here's the modified example:

#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#

import asyncio
import aiohttp
import os
import sys

from pipecat.frames.frames import TextFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.pipeline.runner import PipelineRunner
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport

from runner import configure

from loguru import logger

from dotenv import load_dotenv
load_dotenv(override=True)

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

async def main():
    async with aiohttp.ClientSession() as session:
        (room_url, _) = await configure(session)

        transport = DailyTransport(
            room_url, None, "Say One Thing", DailyParams(audio_out_enabled=True))

        # tts = CartesiaTTSService(
        #     api_key=os.getenv("CARTESIA_API_KEY"),
        #     voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
        # )
        tts = ElevenLabsTTSService(
            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
            model="eleven_multilingual_v2",
        )

        runner = PipelineRunner()

        task = PipelineTask(Pipeline([tts, transport.output()]))

        # Register an event handler so we can play the audio when the
        # participant joins.
        @transport.event_handler("on_participant_joined")
        async def on_new_participant_joined(transport, participant):
            participant_name = participant["info"]["userName"] or ''
            await task.queue_frame(TextFrame(f"Hello there, {participant_name}!"))

        await runner.run(task)

if __name__ == "__main__":
    asyncio.run(main())
durandom commented 2 weeks ago

If I add a

await task.queue_frame(LLMFullResponseEndFrame())

after the TextFrame, then it works 🤷

BrianMwas commented 1 week ago

I am also getting the issue this is my sample code

    try:
        async with aiohttp.ClientSession() as session:
            (room_url, token) = await configure(session)
        logger.debug(f"Getting the room and token from the session {room_url} and {token}")
        transport = DailyTransport(
            room_url, token, "ChatBot", DailyParams(
                audio_in_enabled=True,
                audio_out_enabled=True,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer(),
                transcription_enabled=True,
                vad_audio_passthrough=True
            )
        )

        tts = DeepgramSTTService(
            api_key=config("DEEPGRAM_API_KEY"),
            live_options=LiveOptions(
                encoding="linear16",
                model="nova-2-conversationalai",
                sample_rate=16000,
                channels=1,
                interim_results=True,
                smart_format=True,
                punctuate=True,
                profanity_filter=True,
                vad_events=True,
            )
        )

        llm = OpenAILLMService(api_key=config("OPENAI_API_KEY"), model="gpt-4o")
        messages = [
            {
                "role": "system",
                "content": "You are Chatbot, a friendly, helpful robot. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by introducing yourself."
            }
        ]

        user_response = LLMUserResponseAggregator()
        assistant_response = LLMAssistantResponseAggregator()

        pipeline = Pipeline(
            [
                transport.input(),
                user_response,
                llm,
                tts,
                transport.output(),
                assistant_response
            ]
        )

        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True, enable_metrics=True))

        @transport.event_handler("on_first_participant_joined")
        async def on_first_participant_joined(trans, participant):
            logging.info(f"a participant joined {participant}")
            logging.info(f"we are getting the response {trans}")
            transport.capture_participant_transcription(participant["id"])
            await task.queue_frames([LLMMessagesFrame(messages),])

        @transport.event_handler("on_participant_left")
        async def on_participant_left(trans, participant, reason):
            print(f"Participant left: {participant}")
            logging.info(f"results on leaving the info {trans}")
            await task.queue_frame(EndFrame())

        runner = PipelineRunner()
        await runner.run(task)

    except Exception as e:
        import traceback
        logger.error(f"An error occurred: {str(e)}")
        logger.error(f"we found an issue {traceback.format_exc()}")

if __name__ == '__main__':
    logging.info("we are running")
    # execute only if run as the entry point into the program
    asyncio.run(main()) 

The idea was to just get it to respond back, but it never gives a response, it only connects via daily I can see the chatbot and the details of the chatbot but never a response. Deepgram however shows that my credits have been spent, yesterday it went from 0 - 10$ in one conversation. So not sure what is taking so much

BrianMwas commented 1 week ago

If I add a

await task.queue_frame(LLMFullResponseEndFrame())

after the TextFrame, then it works 🤷

@durandom did you get it to work, really need to complete this part of the app

durandom commented 1 week ago

Yes, queueing the LLMFullResponseEndFrame worked for me. See https://github.com/b4mad/mds-moderator/blob/6f4b37453a6e978f4578feec2c28c714430937a1/participant.py#L85

danthegoodman1 commented 1 week ago

Same issue, had to replace EndFrame with LLMFullResponseEndFrame

danthegoodman1 commented 1 week ago

Seems you can also just remove the end frame

aconchillo commented 1 week ago

I replaced the CartesiaTTSService with ElevenLabsTTSService in the https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/01-say-one-thing.py example, but that doesnt work anymore with 0.0.43

The reason is because CartesiaHttpTTSService blocks on the HTTP request to get the audio and no other frames will be pushed before the generated audio frames. That is, you will get a bunch of audio frames and then the EndFrame which will make things close properly and end the application.

If we replace CartesiaHttpTTSService with something that works asynchronously like ElevenLabsTTSService, adding an EndFrame will cause the app to stop right away. That's because we have no idea when ElevenLabs will give us audio or when the audio will end.

So for this specific use case, you really need to use a TTS service that uses HTTP.

In normal applications you would probably send an EndFrame() when the user disconnects for example.

aconchillo commented 1 week ago

I am also getting the issue this is my sample code

.......
        tts = DeepgramSTTService(
            api_key=config("DEEPGRAM_API_KEY"),
            live_options=LiveOptions(
                encoding="linear16",
                model="nova-2-conversationalai",
                sample_rate=16000,
                channels=1,
                interim_results=True,
                smart_format=True,
                punctuate=True,
                profanity_filter=True,
                vad_events=True,
            )
        )
 .......

The idea was to just get it to respond back, but it never gives a response, it only connects via daily I can see the chatbot and the details of the chatbot but never a response. Deepgram however shows that my credits have been spent, yesterday it went from 0 - 10$ in one conversation. So not sure what is taking so much

The issue in this code is that you are using DeepgramSTTService instead of DeepgramTTSService. Your variable is named properly tts but you are using the wrong service.

aconchillo commented 1 week ago

So, to recap, a couple of issues discussed in here:

  1. EndFrame doesn't really work well in the 01-say-one-thing.py example because of what I explained here https://github.com/pipecat-ai/pipecat/issues/570#issuecomment-2421012277
  2. Type using DeepgramSTTService being used as TTS service instead of DeepgramTTSService.
  3. I don't think LLMFullResponseEndFrame should be really needed. Those are kind of internal frames and are generated by the LLM service. But maybe there's an issue... :thinking:
aconchillo commented 1 week ago

This PR changes the first examples a bit so they don't send EndFrame right away but when the user leaves. https://github.com/pipecat-ai/pipecat/pull/613

danthegoodman1 commented 1 week ago

I replaced the CartesiaTTSService with ElevenLabsTTSService in the https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/01-say-one-thing.py example, but that doesnt work anymore with 0.0.43

The reason is because CartesiaHttpTTSService blocks on the HTTP request to get the audio and no other frames will be pushed before the generated audio frames. That is, you will get a bunch of audio frames and then the EndFrame which will make things close properly and end the application.

If we replace CartesiaHttpTTSService with something that works asynchronously like ElevenLabsTTSService, adding an EndFrame will cause the app to stop right away. That's because we have no idea when ElevenLabs will give us audio or when the audio will end.

So for this specific use case, you really need to use a TTS service that uses HTTP.

In normal applications you would probably send an EndFrame() when the user disconnects for example.

This should be made really clear in the docs

aconchillo commented 1 week ago

I replaced the CartesiaTTSService with ElevenLabsTTSService in the https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/01-say-one-thing.py example, but that doesnt work anymore with 0.0.43

The reason is because CartesiaHttpTTSService blocks on the HTTP request to get the audio and no other frames will be pushed before the generated audio frames. That is, you will get a bunch of audio frames and then the EndFrame which will make things close properly and end the application. If we replace CartesiaHttpTTSService with something that works asynchronously like ElevenLabsTTSService, adding an EndFrame will cause the app to stop right away. That's because we have no idea when ElevenLabs will give us audio or when the audio will end. So for this specific use case, you really need to use a TTS service that uses HTTP. In normal applications you would probably send an EndFrame() when the user disconnects for example.

This should be made really clear in the docs

Totally agree. 😞 We'll get there! 💪