twilio-labs / call-gpt

Generative AI phone call toolkit using Twilio Media Streams.
MIT License
307 stars 132 forks source link

Call GPT: Generative AI Phone Calling

Wouldn't it be neat if you could build an app that allowed you to chat with ChatGPT on the phone?

Twilio gives you a superpower called Media Streams. Media Streams provides a Websocket connection to both sides of a phone call. You can get audio streamed to you, process it, and send audio back.

This app serves as a demo exploring two services:

These service combine to create a voice application that is remarkably better at transcribing, understanding, and speaking than traditional IVR systems.

Features:

Setting up for Development

Prerequisites

Sign up for the following services and get an API key for each:

If you're hosting the app locally, we also recommend using a tunneling service like ngrok so that Twilio can forward audio to your app.

1. Start Ngrok

Start an ngrok tunnel for port 3000:

ngrok http 3000

Ngrok will give you a unique URL, like abc123.ngrok.io. Copy the URL without http:// or https://. You'll need this URL in the next step.

2. Configure Environment Variables

Copy .env.example to .env and configure the following environment variables:

# Your ngrok or server URL
# E.g. 123.ngrok.io or myserver.fly.dev (exlude https://)
SERVER="yourserverdomain.com"

# Service API Keys
OPENAI_API_KEY="sk-XXXXXX"
DEEPGRAM_API_KEY="YOUR-DEEPGRAM-API-KEY"

# Configure your Twilio credentials if you want
# to make test calls using '$ npm test'.
TWILIO_ACCOUNT_SID="YOUR-ACCOUNT-SID"
TWILIO_AUTH_TOKEN="YOUR-AUTH-TOKEN"
FROM_NUMBER='+12223334444'
TO_NUMBER='+13334445555'

3. Install Dependencies with NPM

Install the necessary packages:

npm install

4. Start Your Server in Development Mode

Run the following command:

npm run dev

This will start your app using nodemon so that any changes to your code automatically refreshes and restarts the server.

5. Configure an Incoming Phone Number

Connect a phone number using the Twilio Console.

You can also use the Twilio CLI:

twilio phone-numbers:update +1[your-twilio-number] --voice-url=https://your-server.ngrok.io/incoming

This configuration tells Twilio to send incoming call audio to your app when someone calls your number. The app responds to the incoming call webhook with a Stream TwiML verb that will connect an audio media stream to your websocket server.

Application Workflow

CallGPT coordinates the data flow between multiple different services including Deepgram, OpenAI, and Twilio Media Streams: Call GPT Flow

Modifying the ChatGPT Context & Prompt

Within gpt-service.js you'll find the settings for the GPT's initial context and prompt. For example:

this.userContext = [
  { "role": "system", "content": "You are an outbound sales representative selling Apple Airpods. You have a youthful and cheery personality. Keep your responses as brief as possible but make every attempt to keep the caller on the phone without being rude. Don't ask more than 1 question at a time. Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous. Speak out all prices to include the currency. Please help them decide between the airpods, airpods pro and airpods max by asking questions like 'Do you prefer headphones that go in your ear or over the ear?'. If they are trying to choose between the airpods and airpods pro try asking them if they need noise canceling. Once you know which model they would like ask them how many they would like to purchase and try to get them to place an order. Add a 'β€’' symbol every 5 to 10 words at natural pauses where your response can be split for text to speech." },
  { "role": "assistant", "content": "Hello! I understand you're looking for a pair of AirPods, is that correct?" },
],

About the system Attribute

The system attribute is background information for the GPT. As you build your use-case, play around with modifying the context. A good starting point would be to imagine training a new employee on their first day and giving them the basics of how to help a customer.

There are some context prompts that will likely be helpful to include by default. For example:

These context items help shape a GPT so that it will act more naturally in a phone conversation.

The β€’ symbol context in particular is helpful for the app to be able to break sentences into natural chunks. This speeds up text-to-speech processing so that users hear audio faster.

About the content Attribute

This attribute is your default conversations starter for the GPT. However, you could consider making it more complex and customized based on personalized user data.

In this case, our bot will start off by saying, "Hello! I understand you're looking for a pair of AirPods, is that correct?"

Using Function Calls with GPT

You can use function calls to interact with external APIs and data sources. For example, your GPT could check live inventory, check an item's price, or place an order.

How Function Calling Works

Function calling is handled within the gpt-service.js file in the following sequence:

  1. gpt-service loads function-manifest.js and requires (imports) all functions defined there from the functions directory. Our app will call these functions later when GPT gives us a function name and parameters.

    tools.forEach((tool) => {
    const functionName = tool.function.name;
    availableFunctions[functionName] = require(`../functions/${functionName}`);
    });
  2. When we call GPT for completions, we also pass in the same function-manifest JSON as the tools parameter. This allows the GPT to "know" what functions are available:

const stream = await this.openai.chat.completions.create({
  model: 'gpt-4',
  messages: this.userContext,
  tools, // <-- function-manifest definition
  stream: true,
});
  1. When the GPT responds, it will send us a stream of chunks for the text completion. The GPT will tell us whether each text chunk is something to say to the user, or if it's a tool call that our app needs to execute. This is indicated by the deltas.tool_calls key:
    if (deltas.tool_calls) {
    // handle function calling
    }
  2. Once we have gathered all of the stream chunks about the tool call, our application can run the actual function code that we imported during the first step. The function name and parameters are provided by GPT:
    const functionToCall = availableFunctions[functionName];
    const functionResponse = functionToCall(functionArgs);
  3. As the final step, we add the function response data into the conversation context like this:
this.userContext.push({
  role: 'function',
  name: functionName,
  content: functionResponse,
});

We then ask the GPT to generate another completion including what it knows from the function call. This allows the GPT to respond to the user with details gathered from the external data source.

Adding Custom Function Calls

You can have your GPT call external data sources by adding functions to the /functions directory. Follow these steps:

  1. Create a function (e.g. checkInventory.js in /functions)
  2. Within checkInventory.js, write a function called checkInventory.
  3. Add information about your function to the function-manifest.js file. This information provides context to GPT about what arguments the function takes.

Important: Your function's name must be the same as the file name that contains the function (excluding the .js extension). For example, our function is called checkInventory so we have named the the file checkInventory.js, and set the name attribute in function-manifest.js to be checkInventory.

Example function manifest entry:

{
  type: "function",
  function: {
    name: "checkInventory",
    say: "Let me check our inventory right now.",
    description: "Check the inventory of airpods, airpods pro or airpods max.",
    parameters: {
      type: "object",
      properties: {
        model: {
          type: "string",
          "enum": ["airpods", "airpods pro", "airpods max"],
          description: "The model of airpods, either the airpods, airpods pro or airpods max",
        },
      },
      required: ["model"],
    },
    returns: {
      type: "object",
      properties: {
        stock: {
          type: "integer",
          description: "An integer containing how many of the model are in currently in stock."
        }
      }
    }
  },
}

Using say in the Function Manifest

The say key in the function manifest allows you to define a sentence for the app to speak to the user before calling a function. For example, if a function will take a long time to call you might say "Give me a few moments to look that up for you..."

Receiving Function Arguments

When ChatGPT calls a function, it will provide an object with multiple attributes as a single argument. The parameters included in the object are based on the definition in your function-manifest.js file.

In the checkInventory example above, model is a required argument, so the data passed to the function will be a single object like this:

{
  model: "airpods pro"
}

For our placeOrder function, the arguments passed will look like this:

{
  model: "airpods pro",
  quantity: 10
}

Returning Arguments to GPT

Your function should always return a value: GPT will get confused when the function returns nothing, and may continue trying to call the function expecting an answer. If your function doesn't have any data to return to the GPT, you should still return a response with an instruction like "Tell the user that their request was processed successfully." This prevents the GPT from calling the function repeatedly and wasting tokens.

Any data that you return to the GPT should match the expected format listed in the returns key of function-manifest.js.

Utility Scripts for Placing Calls

The scripts directory contains two files that allow you to place test calls:

Using Eleven Labs for Text to Speech

Replace the Deepgram API call and array transformation in tts-service.js with the following call to Eleven Labs. Note that sometimes Eleven Labs will hit a rate limit (especially on the free trial) and return 400 errors with no audio (or a clicking sound).

try {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM/stream?output_format=ulaw_8000&optimize_streaming_latency=3`,
    {
      method: 'POST',
      headers: {
        'xi-api-key': process.env.XI_API_KEY,
        'Content-Type': 'application/json',
        accept: 'audio/wav',
      },
      body: JSON.stringify({
        model_id: process.env.XI_MODEL_ID,
        text: partialResponse,
      }),
    }
  );

  if (response.status === 200) {
    const audioArrayBuffer = await response.arrayBuffer();
    this.emit('speech', partialResponseIndex, Buffer.from(audioArrayBuffer).toString('base64'), partialResponse, interactionCount);
  } else {
    console.log('Eleven Labs Error:');
    console.log(response);
  }
} catch (err) {
  console.error('Error occurred in XI LabsTextToSpeech service');
  console.error(err);
}

Testing with Jest

Repeatedly calling the app can be a time consuming way to test your tool function calls. This project contains example unit tests that can help you test your functions without relying on the GPT to call them.

Simple example tests are available in the /test directory. To run them, simply run npm run test.

Deploy via Fly.io

Fly.io is a hosting service similar to Heroku that simplifies the deployment process. Given Twilio Media Streams are sent and received from us-east-1, it's recommended to choose Fly's Ashburn, VA (IAD) region.

Deploying to Fly.io is not required to try the app, but can be helpful if your home internet speed is variable.

Modify the app name fly.toml to be a unique value (this must be globally unique).

Deploy the app using the Fly.io CLI:

fly launch

fly deploy

Import your secrets from your .env file to your deployed app:

fly secrets import < .env