googleapis / nodejs-speech

This repository is deprecated. All of its content and history has been moved to googleapis/google-cloud-node.
https://cloud.google.com/speech/
Apache License 2.0
688 stars 290 forks source link

Google speech api IsFinal Response is too slow #163

Closed JustinBeckwith closed 5 years ago

JustinBeckwith commented 5 years ago

From @wassizafar786 on September 7, 2018 5:39

Hi This is me Wassi

I am facing a issue like i am using websocket to send stream to node server and receive result but google cloud speech api send me back isFinal result is very slow Below is my client side code

  this.speechServerClient = new BinaryClient(environment.speechServerUrl)
            .on('error', this.onerror.bind(this))
            .on('open', () =>
            {
                // pass the sampleRate as a parameter to the server and get a reference to the communication stream.
                this.speechServerStream = this.speechServerClient.createStream({
                    type: 'speech',
                    sampleRate: this.audioContext.sampleRate
                });
            })
            .on('stream', (serverStream) =>
            {
                serverStream
                    .on('data', this.onresult.bind(this))
                    .on('error', this.onerror.bind(this))
                    .on('close', this.onerror.bind(this))
            });

and this is my server side code

var options = {
    config: {
        encoding: 'LINEAR16',
        languageCode: 'en-IN',
        sampleRateHertz: 16000,
    },
    singleUtterance: false,
    interimResults: true,
    verbose: true,
};
var speechClient = new Speech.SpeechClient({
    projectId: environment_1.environment.gCloudProjectId,
    keyFilename: 'myfile.json'
});

 var server = new binaryjs.BinaryServer({
        server: httpsServer,
    });
    server
        .on('error', function (error) { console.log('Server error:' + error); })
        .on('close', function () { console.log('Server closed'); })
        .on('connection', function (client) {
        client
            .on('error', function (error) { console.log('Client error: ' + error); })
            .on('close', function () {
            console.log('Client closed.');
        })
            .on('stream', function (clientStream, meta) {
            console.log('New Client: ' + JSON.stringify(meta));
            if (meta.type === 'speech') {
                handleSpeechRequest(client, clientStream, meta);
            }
            else {
                handleRandomUtteranceRequest(client);
            }
        });
    });
}
function handleSpeechRequest(client, clientStream, meta) {
    return __awaiter(this, void 0, void 0, function () {
        var speechStream;
        return __generator(this, function (_a) {
            switch (_a.label) {
                case 0:
                    options.config.sampleRateHertz = meta.sampleRate;
                    return [4 /*yield*/, speechClient.streamingRecognize(options)
                            .on('error', function (data) { handleGCSMessage(data, client, speechStream); })
                            .on('data', function (data) {
                            try {
                                handleGCSMessage(data, client, speechStream);
                                console.log("Transcription: " + data.results[0].alternatives[0].transcript);
                            }
                            catch (ex) {
                                console.log(ex);
                            }
                        })
                            .on('close', function () { client.close(); })];
                case 1:
                    speechStream = _a.sent();
                    clientStream.pipe(speechStream);
                    return [2 /*return*/];
            }
        });
    });
}

Please please tell me the solution

Copied from original issue: GoogleCloudPlatform/google-cloud-node#2860

onofrioP89 commented 5 years ago

Hi all! In our conversational system we use Google Speech API in streaming mode such this example. The partial transcription result is very fast and we recieve a lot of response (confidence 0). Usually the final speechResult arrive after 2 second. After it we use Dialogflow to get the intent of the user and other operation on datastore. In total we need 4 o 5 seconds to answer the user. In a conversation phone based 4 o 5 second is a medium-high latency and we risk that user close the call. For workaround we set a timeout, if we don't receive a speechResult after 1 second from the last partial transcript, we use the last partial transcript as good transcription. Have got any other idea to optimize it?

Thanks Onofrio

nnegrey commented 5 years ago

Using single_utterance and interim_results from https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1#streamingrecognitionconfig might help improve this.

The confidence is only returned for final results, not interim results. Per: https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1#speechrecognitionalternative

beccasaurus commented 5 years ago

Closing now that additional information has been provided!

kidplug commented 5 years ago

I dont think this issue should be closed. I've just started working with the streamingRecognize client, and the "isFinal" result arrives after a LONG delay, and only when additional speech is received.

Here is a recent test. I spoke "one two three four ... ... five ... ... bye". Notice the 3 second long delay between receiving "three" and "four [final]". The "four [final]" only arrived when I said "five", and the "five[final]" arrived when I said "bye".

Tue Jan 22 2019 17:40:18 GMT-0500 (Eastern Standard Time)recStream.data: one Tue Jan 22 2019 17:40:18 GMT-0500 (Eastern Standard Time)recStream.data: one Tue Jan 22 2019 17:40:19 GMT-0500 (Eastern Standard Time)recStream.data: one to Tue Jan 22 2019 17:40:19 GMT-0500 (Eastern Standard Time)recStream.data: one Tue Jan 22 2019 17:40:19 GMT-0500 (Eastern Standard Time)recStream.data: one two three Tue Jan 22 2019 17:40:20 GMT-0500 (Eastern Standard Time)recStream.data: one two three Tue Jan 22 2019 17:40:23 GMT-0500 (Eastern Standard Time)recStream.data: one two three four[final] Tue Jan 22 2019 17:40:24 GMT-0500 (Eastern Standard Time)recStream.data: five Tue Jan 22 2019 17:40:27 GMT-0500 (Eastern Standard Time)recStream.data: five Tue Jan 22 2019 17:40:27 GMT-0500 (Eastern Standard Time)recStream.data: five[final] Tue Jan 22 2019 17:40:27 GMT-0500 (Eastern Standard Time)recStream.data: bye

kidplug commented 5 years ago

Looks like adding this will fix my issue.

singleUtterance: true,

I guess I was thinking each "word" was an utterance... but really the whole phrase "1 2 3 4 " is an utterance.

kidplug commented 4 years ago

Good news to anyone who was following this before - the long streaming recognize mode (not singleUtterance) is now returning the "isFinal" result MUCH MUCH faster - in fact it seems almost equivalent to the single utterance mode.

dsunjka commented 4 years ago

When using german ("de-DE") with streaming recognition, it takes about a MINUTE to get an "is_final" result! Have been experimenting with singleUtterance, but didn't help. Any suggestions? This is not usable for me right now :-(

Switching to "en-US" works perfectly (with singleUtterance=false).

JustinBeckwith commented 4 years ago

Greetings @dsunjka! Could we trouble you to submit a new issue?

Merwan1010 commented 3 years ago

Is it possible to make the silence threshold configurable ? Ex : If i want google to trigger the isFinal : True after one second of silence i just write a single parameter in the config object before initializing the streaming recognition. Would look like this :

const request = {
        config: {
            encoding: encoding,
            sampleRateHertz: sampleRateHertz,
            languageCode: languageCode.traditional,
            profanityFilter: false,
            enableWordTimeOffsets: true,
            enableAutomaticPunctuation : false,
            maxAlternatives : 10,
            model : 'command_and_search',
        },
        interimResults: true, //interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag).
        single_utterance : true //indicates whether this request should automatically end after speech is no longer detected. If set, Speech-to-Text will detect pauses, silence, or non-speech audio to determine when to end recognition.
         silenceThreshold : 2000 //ms
    };
staffik commented 3 years ago

Hi @MadyAkira, did you manage to solve it?

amahlaka commented 3 years ago

Hi, I am also facing this same issue when I set the language to "Finnish" (fi-FI), the isFinal comes way too late, often minutes after the speech has ended, but when I switch to en-US, the isFinal comes end when it is supposed to, this leads me to believe that the issue is related to the engine not being trained enough on European languages and how they end, or something similar

staffik commented 3 years ago

@amahlaka, finally, I set a timeout to manually end the detect stream after 1500ms.

mdifferent commented 3 years ago

Hi all, I am facing this issue in our system, and I found that it depends on the mic library on client. I tried to recognize Vietnamese(vi-VN) on the following environments:

Only the first one works well as following configuration: recorder.record({ sampleRate: 16000, threshold: 0.5, endOnSilence: true, silence: '5.0', }).stream()

sibbl commented 2 years ago

Same here when streaming de-DE. I receive some results with stability=0.9 due to the config InterimResults=true. However, even after sending complete silence to the API, I don't receive any result with IsFinal.

Setting SingleUtterance=true additionally, I receive a response of type EndOfSingleUtterance. Unfortunately this doesn't include any transcriptions nor is it useful to use any transcription from interim results as they're not complete. Words are still missing even though end of speech was detected.

xoxwgys56 commented 2 years ago

Also same issue here. In my case, I use Unity like @sibbl said I add parameter SingleUtterance=true on StreamingRecognitionConfig. and like he said it has a minor issue do not listening after when recognize final flag.

But as you guys know, it is not an error. referenced by google cloud API document about StreamingRecognitionConfig.

amolerca commented 2 years ago

I've been able to solve this by manually selecting the model on StreamingRecognitionConfig. Setting model: "latest_long" did the trick for me, even with SingleUtterance: false.