rhasspy / wyoming

Peer-to-peer protocol for voice assistants
MIT License
103 stars 17 forks source link

Wyoming Protocol

A peer-to-peer protocol for voice assistants (basically JSONL + PCM audio)

{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n
<data_length bytes (optional)>
<payload_length bytes (optional)>

Used in Rhasspy and Home Assistant for communication with voice services.

This is an open standard of the Open Home Foundation.

Wyoming Projects

Format

  1. A JSON object header as a single line with \n (UTF-8, required)
    • type - event type (string, required)
    • data - event data (object, optional)
    • data_length - bytes of additional data (int, optional)
    • payload_length - bytes of binary payload (int, optional)
  2. Additional data (UTF-8, optional)
    • JSON object with additional event-specific data
    • Merged on top of header data
    • Exactly data_length bytes long
    • Immediately follows header \n
  3. Payload
    • Typically PCM audio but can be any binary data
    • Exactly payload_length bytes long
    • Immediately follows additional data or header \n if no additional data

Event Types

Available events with type and fields.

Audio

Send raw audio and indicate begin/end of audio streams.

Info

Describe available services.

Speech Recognition

Transcribe audio into text.

Text to Speech

Synthesize audio from text.

Wake Word

Detect wake words in an audio stream.

Voice Activity Detection

Detects speech and silence in an audio stream.

Intent Recognition

Recognizes intents from text.

Intent Handling

Handle structured intents or text directly.

Audio Output

Play audio stream.

Voice Satellite

Control of one or more remote voice satellites connected to a central server.

Pipelines are run on the server, but can be triggered remotely from the server as well.

Timers

Event Flow

Service Description

  1. describe (required)
  2. info (required)

Speech to Text

  1. transcribe event with name of model to use or language (optional)
  2. audio-start (required)
  3. audio-chunk (required)
    • Send audio chunks until silence is detected
  4. audio-stop (required)
  5. transcript
    • Contains text transcription of spoken audio

Text to Speech

  1. synthesize event with text (required)
  2. audio-start
  3. audio-chunk
    • One or more audio chunks
  4. audio-stop

Wake Word Detection

  1. detect event with names of wake words to detect (optional)
  2. audio-start (required)
  3. audio-chunk (required)
    • Keep sending audio chunks until a detection is received
  4. detection
    • Sent for each wake word detection
  5. audio-stop (optional)
    • Manually end audio stream
  6. not-detected
    • Sent after audio-stop if no detections occurred

Voice Activity Detection

  1. audio-chunk (required)
    • Send audio chunks until silence is detected
  2. voice-started
    • When speech starts
  3. voice-stopped
    • When speech stops

Intent Recognition

  1. recognize (required)
  2. intent if successful
  3. not-recognized if not successful

Intent Handling

For structured intents:

  1. intent (required)
  2. handled if successful
  3. not-handled if not successful

For text only:

  1. transcript with text to handle (required)
  2. handled if successful
  3. not-handled if not successful

Audio Output

  1. audio-start (required)
  2. audio-chunk (required)
    • One or more audio chunks
  3. audio-stop (required)
  4. played