Open Shulyaka opened 10 months ago
well, i fiddled with mine to do that..
and it needs a redesign. can't wait for the response from a chunk, so have to spin off a async task to handle the waits, and if there is text, send it some place to consolidate with prior and maybe signal done and then on the send side, don't know what happens if the handler blocks while transcribing. does it hold up the next block arriving> buffering.. so one would have to spin off another async thread with a queue to handle the transcribes.. and sends.. and figure out all the audio data alignment to all the interim transcribes.
so I modified my asr to do interim results, on the fly.. but as suspected it will taks some work to figure out what to do with the audio data..
currenlty for test, if the transcriber returns text (not '') then I send that back and drop the audio input saved to here. effectively starting over... BUT.. this truncates some of the text response..
should be testing testing testing testing
but got test test Washigton testing test with a lot of no text responses in between
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=Washington
returned text=
returned text=
returned text=testing
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
I don't know what my transcriber does under the covers..
I think this could be done with the appropriate start/stop/chunk events. So for ASR/STT response, it could be TranscriptStart
, TranscriptStop
, TranscriptChunk
. This way, the server would be able to differentiate it well from the original Transcript
which is the whole thing at once.
transcript is at the end transcribe is at the start
I think another Parm on the Transcribe would indicate that the client is enabled for interim results.
transcriptchunk implies the client is processing the chunks somehow but it's streaming from the mic non stop
it's unlikely that every client would change.
the current whisper sends the results on audio stop, not transcript anyhow.
but that doesn't tell the client if the server will send interim results. currently Transcribe doesn't have a response
Maybe we could use the Describe Info response to indicate if the asr supports intermediate responses
then I suppsose a new TranscriptChunk event out from the asr would inform the client
It would be good to support chunked text the same way we support chucked audio. The reason is that LLMs produce the text token-by-token, and when the text is big, we would like to start producing the audio via tts right away instead of waiting.