A peer-to-peer protocol for voice assistants (basically JSONL + PCM audio)
{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n
<data_length bytes (optional)>
<payload_length bytes (optional)>
Used in Rhasspy and Home Assistant for communication with voice services.
This is an open standard of the Open Home Foundation.
\n
(UTF-8, required)
type
- event type (string, required)data
- event data (object, optional)data_length
- bytes of additional data (int, optional)payload_length
- bytes of binary payload (int, optional)data
data_length
bytes long\n
payload_length
bytes long\n
if no additional dataAvailable events with type
and fields.
Send raw audio and indicate begin/end of audio streams.
audio-chunk
- chunk of raw PCM audio
rate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)timestamp
- timestamp of audio chunk in milliseconds (int, optional)audio-start
- start of an audio stream
rate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)timestamp
- timestamp in milliseconds (int, optional)audio-stop
- end of an audio stream
timestamp
- timestamp in milliseconds (int, optional)Describe available services.
describe
- request for available voice servicesinfo
- response describing available voice services
asr
- list speech recognition services (optional)
models
- list of available models (required)
name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)
name
- name of creator (required)url
- URL of creator (required)installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)tts
- list text to speech services (optional)
models
- list of available models
name
- unique name (required)languages
- supported languages by model (list of string, required)speakers
- list of speakers (optional)
name
- unique name of speaker (required)attribution
(required)
name
- name of creator (required)url
- URL of creator (required)installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)wake
- list wake word detection services( optional )
models
- list of available models (required)
name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)
name
- name of creator (required)url
- URL of creator (required)installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)handle
- list intent handling services (optional)
models
- list of available models (required)
name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)
name
- name of creator (required)url
- URL of creator (required)installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)intent
- list intent recognition services (optional)
models
- list of available models (required)
name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)
name
- name of creator (required)url
- URL of creator (required)installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)version
- version of the model (string, optional)satellite
- information about voice satellite (optional)
area
- name of area where satellite is located (string, optional)has_vad
- true if the end of voice commands will be detected locally (boolean, optional)active_wake_words
- list of wake words that are actively being listend for (list of string, optional)max_active_wake_words
- maximum number of local wake words that can be run simultaneously (number, optional)supports_trigger
- true if satellite supports remotely-triggered pipelinesmic
- list of audio input services (optional)
mic_format
- audio input format (required)
rate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)snd
- list of audio output services (optional)
snd_format
- audio output format (required)
rate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)Transcribe audio into text.
transcribe
- request to transcribe an audio stream
name
- name of model to use (string, optional)language
- language of spoken audio (string, optional)context
- context from previous interactions (object, optional)transcript
- response with transcription
text
- text transcription of spoken audio (string, required)context
- context for next interaction (object, optional)Synthesize audio from text.
synthesize
- request to generate audio from text
text
- text to speak (string, required)voice
- use a specific voice (optional)
name
- name of voice (string, optional)language
- language of voice (string, optional)speaker
- speaker of voice (string, optional)Detect wake words in an audio stream.
detect
- request detection of specific wake word(s)
names
- wake word names to detect (list of string, optional)detection
- response when detection occurs
name
- name of wake word that was detected (int, optional)timestamp
- timestamp of audio chunk in milliseconds when detection occurred (int optional)not-detected
- response when audio stream ends without a detectionDetects speech and silence in an audio stream.
voice-started
- user has started speaking
timestamp
- timestamp of audio chunk when speaking started in milliseconds (int, optional)voice-stopped
- user has stopped speaking
timestamp
- timestamp of audio chunk when speaking stopped in milliseconds (int, optional)Recognizes intents from text.
recognize
- request to recognize an intent from text
text
- text to recognize (string, required)context
- context from previous interactions (object, optional)intent
- response with recognized intent
name
- name of intent (string, required)entities
- list of entities (optional)
name
- name of entity (string, required)value
- value of entity (any, optional)text
- response for user (string, optional)context
- context for next interactions (object, optional)not-recognized
- response indicating no intent was recognized
text
- response for user (string, optional)context
- context for next interactions (object, optional)Handle structured intents or text directly.
handled
- response when intent was successfully handled
text
- response for user (string, optional)context
- context for next interactions (object, optional)not-handled
- response when intent was not handled
text
- response for user (string, optional)context
- context for next interactions (object, optional)Play audio stream.
played
- response when audio finishes playingControl of one or more remote voice satellites connected to a central server.
run-satellite
- informs satellite that server is ready to run pipelinespause-satellite
- informs satellite that server is not ready anymore to run pipelinessatellite-connected
- satellite has connected to the serversatellite-disconnected
- satellite has been disconnected from the serverstreaming-started
- satellite has started streaming audio to the serverstreaming-stopped
- satellite has stopped streaming audio to the serverPipelines are run on the server, but can be triggered remotely from the server as well.
run-pipeline
- runs a pipeline on the server or asks the satellite to run it when possible
start_stage
- pipeline stage to start at (string, required)end_stage
- pipeline stage to end at (string, required)wake_word_name
- name of detected wake word that started this pipeline (string, optional)
wake_word_names
- names of wake words to listen for (list of string, optional)
start_stage
must be "wake"announce_text
- text to speak on the satellite
start_stage
must be "tts"restart_on_end
- true if the server should re-run the pipeline after it ends (boolean, default is false)
timer-started
- a new timer has started
id
- unique id of timer (string, required)total_seconds
- number of seconds the timer should run for (int, required)name
- user-provided name for timer (string, optional)start_hours
- hours the timer should run for as spoken by user (int, optional)start_minutes
- minutes the timer should run for as spoken by user (int, optional)start_seconds
- seconds the timer should run for as spoken by user (int, optional)command
- optional command that the server will execute when the timer is finished
text
- text of command to execute (string, required)language
- language of the command (string, optional)timer-updated
- timer has been paused/resumed or time has been added/removed
id
- unique id of timer (string, required)is_active
- true if timer is running, false if paused (bool, required)total_seconds
- number of seconds that the timer should run for now (int, required)timer-cancelled
- timer was cancelled
id
- unique id of timer (string, required)timer-finished
- timer finished without being cancelled
id
- unique id of timer (string, required)describe
(required) info
(required)transcribe
event with name
of model to use or language
(optional)audio-start
(required)audio-chunk
(required)
audio-stop
(required)transcript
synthesize
event with text
(required)audio-start
audio-chunk
audio-stop
detect
event with names
of wake words to detect (optional)audio-start
(required)audio-chunk
(required)
detection
is receiveddetection
audio-stop
(optional)
not-detected
audio-stop
if no detections occurredaudio-chunk
(required)
voice-started
voice-stopped
recognize
(required)intent
if successfulnot-recognized
if not successfulFor structured intents:
intent
(required)handled
if successfulnot-handled
if not successfulFor text only:
transcript
with text
to handle (required)handled
if successfulnot-handled
if not successfulaudio-start
(required)audio-chunk
(required)
audio-stop
(required)played