rhasspy / wyoming

Peer-to-peer protocol for voice assistants
MIT License
103 stars 17 forks source link

Text streaming support #5

Open Shulyaka opened 8 months ago

Shulyaka commented 8 months ago

It would be good to support chunked text the same way we support chucked audio. The reason is that LLMs produce the text token-by-token, and when the text is big, we would like to start producing the audio via tts right away instead of waiting.

sdetweil commented 8 months ago

well, i fiddled with mine to do that..

and it needs a redesign. can't wait for the response from a chunk, so have to spin off a async task to handle the waits, and if there is text, send it some place to consolidate with prior and maybe signal done and then on the send side, don't know what happens if the handler blocks while transcribing. does it hold up the next block arriving> buffering.. so one would have to spin off another async thread with a queue to handle the transcribes.. and sends.. and figure out all the audio data alignment to all the interim transcribes.

sdetweil commented 8 months ago

so I modified my asr to do interim results, on the fly.. but as suspected it will taks some work to figure out what to do with the audio data..

currenlty for test, if the transcriber returns text (not '') then I send that back and drop the audio input saved to here. effectively starting over... BUT.. this truncates some of the text response..

should be testing testing testing testing

but got test test Washigton testing test with a lot of no text responses in between

returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=Washington
returned text=
returned text=
returned text=testing
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=

I don't know what my transcriber does under the covers..

synesthesiam commented 7 months ago

I think this could be done with the appropriate start/stop/chunk events. So for ASR/STT response, it could be TranscriptStart, TranscriptStop, TranscriptChunk. This way, the server would be able to differentiate it well from the original Transcript which is the whole thing at once.

sdetweil commented 7 months ago

transcript is at the end transcribe is at the start

I think another Parm on the Transcribe would indicate that the client is enabled for interim results.

transcriptchunk implies the client is processing the chunks somehow but it's streaming from the mic non stop

it's unlikely that every client would change.

the current whisper sends the results on audio stop, not transcript anyhow.

sdetweil commented 7 months ago

but that doesn't tell the client if the server will send interim results. currently Transcribe doesn't have a response

sdetweil commented 7 months ago

Maybe we could use the Describe Info response to indicate if the asr supports intermediate responses

then I suppsose a new TranscriptChunk event out from the asr would inform the client