Closed JustinBeckwith closed 5 years ago
Hi all! In our conversational system we use Google Speech API in streaming mode such this example. The partial transcription result is very fast and we recieve a lot of response (confidence 0). Usually the final speechResult arrive after 2 second. After it we use Dialogflow to get the intent of the user and other operation on datastore. In total we need 4 o 5 seconds to answer the user. In a conversation phone based 4 o 5 second is a medium-high latency and we risk that user close the call. For workaround we set a timeout, if we don't receive a speechResult after 1 second from the last partial transcript, we use the last partial transcript as good transcription. Have got any other idea to optimize it?
Thanks Onofrio
Using single_utterance
and interim_results
from https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1#streamingrecognitionconfig might help improve this.
The confidence is only returned for final results, not interim results. Per: https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1#speechrecognitionalternative
Closing now that additional information has been provided!
I dont think this issue should be closed. I've just started working with the streamingRecognize client, and the "isFinal" result arrives after a LONG delay, and only when additional speech is received.
Here is a recent test. I spoke "one two three four ... ... five ... ... bye". Notice the 3 second long delay between receiving "three" and "four [final]". The "four [final]" only arrived when I said "five", and the "five[final]" arrived when I said "bye".
Tue Jan 22 2019 17:40:18 GMT-0500 (Eastern Standard Time)recStream.data: one Tue Jan 22 2019 17:40:18 GMT-0500 (Eastern Standard Time)recStream.data: one Tue Jan 22 2019 17:40:19 GMT-0500 (Eastern Standard Time)recStream.data: one to Tue Jan 22 2019 17:40:19 GMT-0500 (Eastern Standard Time)recStream.data: one Tue Jan 22 2019 17:40:19 GMT-0500 (Eastern Standard Time)recStream.data: one two three Tue Jan 22 2019 17:40:20 GMT-0500 (Eastern Standard Time)recStream.data: one two three Tue Jan 22 2019 17:40:23 GMT-0500 (Eastern Standard Time)recStream.data: one two three four[final] Tue Jan 22 2019 17:40:24 GMT-0500 (Eastern Standard Time)recStream.data: five Tue Jan 22 2019 17:40:27 GMT-0500 (Eastern Standard Time)recStream.data: five Tue Jan 22 2019 17:40:27 GMT-0500 (Eastern Standard Time)recStream.data: five[final] Tue Jan 22 2019 17:40:27 GMT-0500 (Eastern Standard Time)recStream.data: bye
Looks like adding this will fix my issue.
singleUtterance: true,
I guess I was thinking each "word" was an utterance... but really the whole phrase "1 2 3 4 " is an utterance.
Good news to anyone who was following this before - the long streaming recognize mode (not singleUtterance) is now returning the "isFinal" result MUCH MUCH faster - in fact it seems almost equivalent to the single utterance mode.
When using german ("de-DE") with streaming recognition, it takes about a MINUTE to get an "is_final" result! Have been experimenting with singleUtterance, but didn't help. Any suggestions? This is not usable for me right now :-(
Switching to "en-US" works perfectly (with singleUtterance=false).
Greetings @dsunjka! Could we trouble you to submit a new issue?
Is it possible to make the silence threshold configurable ?
Ex : If i want google to trigger the isFinal : True
after one second of silence i just write a single parameter in the config object before initializing the streaming recognition.
Would look like this :
const request = {
config: {
encoding: encoding,
sampleRateHertz: sampleRateHertz,
languageCode: languageCode.traditional,
profanityFilter: false,
enableWordTimeOffsets: true,
enableAutomaticPunctuation : false,
maxAlternatives : 10,
model : 'command_and_search',
},
interimResults: true, //interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag).
single_utterance : true //indicates whether this request should automatically end after speech is no longer detected. If set, Speech-to-Text will detect pauses, silence, or non-speech audio to determine when to end recognition.
silenceThreshold : 2000 //ms
};
Hi @MadyAkira, did you manage to solve it?
Hi, I am also facing this same issue when I set the language to "Finnish" (fi-FI), the isFinal comes way too late, often minutes after the speech has ended, but when I switch to en-US, the isFinal comes end when it is supposed to, this leads me to believe that the issue is related to the engine not being trained enough on European languages and how they end, or something similar
@amahlaka, finally, I set a timeout to manually end the detect stream after 1500ms.
Hi all, I am facing this issue in our system, and I found that it depends on the mic library on client. I tried to recognize Vietnamese(vi-VN) on the following environments:
Only the first one works well as following configuration:
recorder.record({ sampleRate: 16000, threshold: 0.5, endOnSilence: true, silence: '5.0', }).stream()
Same here when streaming de-DE
. I receive some results with stability=0.9
due to the config InterimResults=true
.
However, even after sending complete silence to the API, I don't receive any result with IsFinal
.
Setting SingleUtterance=true
additionally, I receive a response of type EndOfSingleUtterance
. Unfortunately this doesn't include any transcriptions nor is it useful to use any transcription from interim results as they're not complete. Words are still missing even though end of speech was detected.
Also same issue here. In my case, I use Unity like @sibbl said I add parameter SingleUtterance=true
on StreamingRecognitionConfig
. and like he said it has a minor issue do not listening after when recognize final
flag.
But as you guys know, it is not an error. referenced by google cloud API document about StreamingRecognitionConfig
.
I've been able to solve this by manually selecting the model on StreamingRecognitionConfig
. Setting model: "latest_long"
did the trick for me, even with SingleUtterance: false
.
From @wassizafar786 on September 7, 2018 5:39
Hi This is me Wassi
I am facing a issue like i am using websocket to send stream to node server and receive result but google cloud speech api send me back isFinal result is very slow Below is my client side code
and this is my server side code
Please please tell me the solution
Copied from original issue: GoogleCloudPlatform/google-cloud-node#2860