openai / openai-realtime-console

React app for inspecting, building and debugging with the Realtime API
MIT License
1.86k stars 623 forks source link

I think RealtimeRelay should use RealtimeAPI instead of RealTimeClient #462

Open Phodaie opened 1 day ago

Phodaie commented 1 day ago

I have the following version RealtimeRelay that uses RealtimeAPI instead of RealtimeClient. RealtimeClient has additional functionality (e.g. state management) that is not needed in the Relay.


import { WebSocketServer } from 'ws';
import { RealtimeAPI } from '@openai/realtime-api-beta';

export class RealtimeRelay {
  constructor(apiKey) {
    this.apiKey = apiKey;
    this.sockets = new WeakMap();
    this.wss = null;
  }

  listen(port) {
    this.wss = new WebSocketServer({ port });
    this.wss.on('connection', this.connectionHandler.bind(this));
    this.log(`Listening on ws://localhost:${port}`);
  }

  async connectionHandler(ws, req) {
    if (!req.url) {
      this.log('No URL provided, closing connection.');
      ws.close();
      return;
    }

    const url = new URL(req.url, `http://${req.headers.host}`);
    const pathname = url.pathname;

    if (pathname !== '/') {
      this.log(`Invalid pathname: "${pathname}"`);
      ws.close();
      return;
    }

    // Instantiate new client
    this.log(`Connecting with key "${this.apiKey.slice(0, 3)}..."`);

    const realtime = new RealtimeAPI({ apiKey: this.apiKey });
    // Relay: OpenAI Realtime API Event -> Browser Event
        realtime.on('server.*', (event) => {
        this.log(`Relaying "${event.type}" to Client`);
        ws.send(JSON.stringify(event));
    });
    realtime.on('close', () => ws.close());

    // Relay: Browser Event -> OpenAI Realtime API Event
    // We need to queue data waiting for the OpenAI connection
    const messageQueue = [];
    const messageHandler = (data) => {
      try {
        const event = JSON.parse(data);
        this.log(`Relaying "${event.type}" to OpenAI`);
        realtime.send(event.type, event);
      } catch (e) {
        console.error(e.message);
        this.log(`Error parsing event from client: ${data}`);
      }
    };
    ws.on('message', (data) => {
      if (!realtime.isConnected()) {
        messageQueue.push(data);
      } else {
        messageHandler(data);
      }
    });
    ws.on('close', () => {
        realtime.isConnected() && realtime.disconnect();
    });

    // Connect to OpenAI Realtime API
    try {
      this.log(`Connecting to OpenAI...`);
      await realtime.connect();;
    } catch (e) {
      this.log(`Error connecting to OpenAI: ${e.message}`);
      ws.close();
      return;
    }
    this.log(`Connected to OpenAI successfully!`);
    while (messageQueue.length) {
      messageHandler(messageQueue.shift());
    }
  }

  log(...args) {
    console.log(`[RealtimeRelay --------------------------]`, ...args);
  }
}
radrad commented 16 hours ago

What is it that you are suggesting? Where and what file are you changing?

Phodaie commented 15 hours ago

RealtimeRelay class in relay.js should be changed so it uses RealtimeAPI instead RealtimeClient.

radrad commented 13 hours ago

Below is ChatGPT o1-preview's analysis of the difference between RealtimeAPI and RealtimeClient.

I want to know if I need to implement something at the relay level where I would need RealtimeClient's interfaces, such as function calling on the server side, and to keep the state of conversations in a database to preserve the history of multiple chat sessions with a list of messages exchanged between a user and the system.

**With this change would we be limited to browser-only usage of RealtimeClient and whatever it can offer?

Can anyone explain what is happening, where, and when it would be appropriate (use cases) to have RealtimeClient present on the relay server?**

RealtimeAPI Purpose: Provides a low-level interface for connecting to the OpenAI Realtime API via WebSocket. Functionality: Handles basic WebSocket operations such as connecting, sending, and receiving messages. Dispatches events but does not manage state or conversation context. Designed for scenarios where you need direct and minimal control over the communication with the Realtime API. Use Case: Ideal for intermediary services like the RealtimeRelay, which simply forwards messages between a client and the OpenAI Realtime API without the need for additional state management or conversation handling.

RealtimeClient Purpose: Offers a higher-level abstraction built on top of the RealtimeAPI. Functionality: The RealtimeClient is a high-level client library designed to simplify and enhance interactions with the OpenAI Realtime API. It builds upon the foundational capabilities of the RealtimeAPI by adding advanced features and abstractions that facilitate the development of complex, stateful, and multi-modal conversational applications.

Key Features and Functionalities: Session Management:

Allows developers to configure and manage sessions, including setting modalities (e.g., text, audio), instructions, voice options, and temperature settings. Handles session updates and ensures consistent configurations across interactions. Conversation State Maintenance:

Maintains the state of conversations across multiple interactions, enabling context preservation and multi-turn dialogues. Manages conversation history and item tracking, which is crucial for applications that require an understanding of prior exchanges. Audio Transcription Handling:

Integrates audio transcription capabilities, enabling the processing of audio inputs (speech-to-text) and generating audio outputs (text-to-speech). Supports different audio formats and transcription models, facilitating seamless audio interactions. Turn Detection:

Includes mechanisms for detecting when a user has finished speaking, which is essential for natural conversational flow in voice-based applications. Supports server-side Voice Activity Detection (VAD) to manage speech start and stop events. Tool Integration:

Allows the incorporation of custom tools or functions that the assistant can call during a conversation. Manages tool definitions and handlers, enabling dynamic responses and actions based on user input. Utilities for Content Handling:

Provides methods and structures to handle various content types, such as text messages, audio clips, and function calls. Manages input and output content, ensuring proper formatting and delivery. Comprehensive Event Handling:

Includes advanced event handling to manage conversation flow, response generation, and error handling. Dispatches events for conversation updates, item completions, and other significant actions, allowing developers to respond appropriately. Use Cases:

Complex Conversational Applications:

Ideal for applications that require advanced conversation management, such as virtual assistants, chatbots, or customer service agents. Multi-Modal Interactions:

Suitable for applications that need to handle multiple input and output modalities, including text and audio. Stateful Interactions:

Essential for applications where maintaining context across multiple exchanges enhances the user experience. Tool-Enabled Conversations:

Beneficial for applications that require dynamic functionality, where the assistant can perform actions or provide information by invoking custom tools or functions. Conclusion:

The RealtimeClient serves as a comprehensive solution for developers aiming to build sophisticated conversational interfaces using the OpenAI Realtime API. By abstracting lower-level details and providing robust features for session and conversation management, it empowers developers to focus on crafting engaging and effective user experiences without worrying about the complexities of direct API interactions.