microsoft / cognitive-services-speech-sdk-js

Microsoft Azure Cognitive Services Speech SDK for JavaScript
Other
263 stars 98 forks source link

Too few chunks send to speech server in throtteled / minimized browser #74

Closed LMIT-Dev closed 4 years ago

LMIT-Dev commented 5 years ago

Hi,

we've encountered an issue where an user using a speech to text app would not see any recognized data. This happens when the browser is minimized (in throtteled mode). As soon as the browser is pulled up again, the microphone data is send to the speech server.

We've tracked the issue down to the class ServiceRecognizerBase and its method "sendAudio". This method creates a function "readAndUploadCycle" that is then committed into a Timeout ("setTimeout"). The timout interval is calculated to provide a good send timeout. However, in a throtteled JS execution environment, this timeout interval is ignored and approximately only 1 chunk per second will be sent resulting in a huge and fast growing delay.

We believe this method should actually send multiple chunks to account for throtteled JS execution.

We hope this information is sufficient and would of course like to see an improvment of that logic soon.

For now we've asked our users to keep the browser window open in the background and not minimize it.

Anyways, we would also like to take the opportunity to thank you for the great work and offer our assistance if needed / desired. We are considering to create a fork to implement a logic suited for a throtteled environment if you feel that this issue isn't worth of improvements or are too busy for a near future investigation.

Thank you in advance for taking the time to consider this issue. If possible please respond as soon as its possible.

LMIT-Dev commented 5 years ago

To resolve this:

                                    setTimeout(function () {
                                        lastSendTime = Date.now();
                                        readAndUploadCycle();
                                    }, delay_1);

=>

                                    if (audioStreamNode.privBuffers.length == 1) {
                                        setTimeout(function () {
                                            lastSendTime = Date.now();
                                            readAndUploadCycle();
                                        }, delay_1);
                                    } else {
                                        new Promise(readAndUploadCycle);
                                    }
rhurey commented 5 years ago

Well, that's not good. Thanks for narrowing down the problem. Thanks a lot.

I was thinking of solutions over the last day, and there's some complications in most of them.

If we just looked at the buffers to see if there's pending data and always sent it, the the transmission rate would be excessively high for all faster than real-time sources. We throttle transmission to ~2x real-time because the service side has some maximum processing rates to manage resource usage. If a user created a PushStream an dropped 1MB of data into it from a file, we slow transmission and try to ensure that the data is sent over the next ~15 seconds. (16bit, 16Khz, mono...). (This isn't an uncommon scenario for users from node.js)

I was hoping to come up with a smaller fix where response latency from speech would not be effected, but if it's a minimized browser I assume a higher latency would be acceptable.

One option would be to look at the audio source and determine if it's a real-time source. If it is, push all data available every cycle. privAudioSource.deviceInfo.type contains this information as a Promise result.

Another option would be to check each interval and see if we've managed to not fall behind sending audio and send unthrottled to catch up. The service would see bursty audio, but (in this case) at one second intervals which should be ok. This would also help insulate from developers who do expensive things in event handlers and slow down processing that way.

The other ideas I had all seemed like major surgery.

We'd gladly accept some help if fixing this is you have bandwidth to do so. Due to other commitments and schedules our timeline for a fix would likely not be fast.

InDieTasten commented 5 years ago

Well, sending one second worth of chunks every second shouldn't exceed rate limits on the server-side. As far as I can tell from the initial description, the current implementation will spam the servers with almost all chunks recorded since the window was last maximized, which is more likely to cause load issues on the server-side.

rhurey commented 5 years ago

I agree one seconds bursts should be fine.

Since setTimeout is used to schedule the next audio frame send, and if one fires every second, I'd assumed that the audio was backing up in the IAuduioSource buffer and transmitted at 2x real-time when the browser throttle was removed, but admittedly didn't experiment.

LMIT-Dev commented 5 years ago

Thanks for getting back to us.

I did the following now to make sure to not stream directly:

In StreamReader ctor i introduced thresholds to determine if the buffer should be emptied. The returned chunk on the read method carries the request as a flag:

    function StreamReader(streamId, readerQueue, onClose) {
        var _this = this;
        this.stopEmptyingBufferThreshold = 16;
        this.requestEmptyingBufferThreshold = 64;
        this.requestEmptyingBuffer = false;
        this.privIsClosed = false;
        this.read = function () {
            if (_this.isClosed) {
                throw new Error_1.InvalidOperationError("StreamReader closed");
            }
            return _this.privReaderQueue
                .dequeue()
                .onSuccessContinueWith(function (streamChunk) {
                if (streamChunk.isEnd) {
                    _this.privReaderQueue.dispose("End of stream reached");
                }

                if (_this.privReaderQueue != null && _this.privReaderQueue.length() > _this.requestEmptyingBufferThreshold) {
                    _this.requestEmptyingBuffer = true;
                }
                if (_this.requestEmptyingBuffer && (_this.privReaderQueue == null || _this.privReaderQueue.length() <= _this.stopEmptyingBufferThreshold)) {
                    _this.requestEmptyingBuffer = false;
                }

                streamChunk.emptyBufferRequested = _this.requestEmptyingBuffer;
                return streamChunk;
            });
        };
        this.close = function () {
            if (!_this.privIsClosed) {
                _this.privIsClosed = true;
                _this.privReaderQueue.dispose("StreamReader closed");
                _this.privOnClose();
            }
        };
        this.privReaderQueue = readerQueue;
        this.privOnClose = onClose;
        this.privStreamId = streamId;
    }

In the ServiceBaseRecognizer.sendAudio i rewrote the readAndUploadCycle to react to the emptyBufferRequested property on the chunk:

                // If speech is done, stop sending audio.
                if (!_this.privIsDisposed && !_this.privRequestSession.isSpeechEnded && _this.privRequestSession.isRecognizing) {
                    _this.fetchConnection().on(function (connection) {
                        audioStreamNode.read().on(function (audioStreamChunk) {
                            // we have a new audio chunk to upload.
                            if (_this.privRequestSession.isSpeechEnded) {
                                // If service already recognized audio end then dont send any more audio
                                deferred.resolve(true);
                                return;
                            }

                            var payload = (audioStreamChunk.isEnd) ? null : audioStreamChunk.buffer;
                            var uploaded = connection.send(new SpeechConnectionMessage_Internal_1.SpeechConnectionMessage(Exports_2.MessageType.Binary, "audio", _this.privRequestSession.requestId, null, payload));
                            if (audioStreamChunk.isEnd) {
                                // the audio stream has been closed, no need to schedule next
                                // read-upload cycle.
                                _this.privRequestSession.onSpeechEnded();
                                deferred.resolve(true);
                            }
                            else {
                                if (audioStreamChunk.emptyBufferRequested === true) {
                                    uploaded.continueWith(function (_) {
                                        lastSendTime = Date.now();
                                        new Promise(readAndUploadCycle());
                                    });
                                } else {
                                    uploaded.continueWith(function (_) {
                                        var minSendTime = ((payload.byteLength / audioFormat.avgBytesPerSec) / 2) * 1000;
                                        var delay_1 = Math.max(64, (lastSendTime - Date.now() + minSendTime));

                                        setTimeout(function () {
                                            lastSendTime = Date.now();
                                            readAndUploadCycle();
                                        }, delay_1);
                                    });
                                }
                            }
                        }, function (error) {
                            if (_this.privRequestSession.isSpeechEnded) {
                                // For whatever reason, Reject is used to remove queue subscribers inside
                                // the Queue.DrainAndDispose invoked from DetachAudioNode down below, which
                                // means that sometimes things can be rejected in normal circumstances, without
                                // any errors.
                                deferred.resolve(true); // TODO: remove the argument, it's is completely meaningless.
                            }
                            else {
                                // Only reject, if there was a proper error.
                                deferred.reject(error);
                            }
                        });
                    }, function (error) {
                        deferred.reject(error);
                    });
                }

So when the reader has more than requestEmptyingBufferThreshold chunks available, it will set the flag on the next chunk and the ReadAndUpload Cycle will trigger without a timeout.

Of course the thresholds should be configurable.

What do you think?

rhurey commented 5 years ago

How does that work if the buffer has 3MB of data in it? How long does it take to transmit?

If I'm reading it right, it'll try to transmit most of the data immediately. Which would be great for a microphone that's gotten behind, but would be incorrect for a file.

I'd thought of something like

// The promise should have been evaluated already when telemetry was sent, so it should just return the cached result.
IsRealTimeAudioSource(this.privAudioSoruce).onSuccessContinueWith( (realTime:boolean)=>{
  if (realTime) {
    uploaded.continueWith(function (_) {
    lastSendTime = Date.now();
    new Promise(readAndUploadCycle());
    });
  } else {
    uploaded.continueWith(function (_) {
      var minSendTime = ((payload.byteLength / audioFormat.avgBytesPerSec) / 2) * 1000;
      var delay_1 = Math.max(64, (lastSendTime - Date.now() + minSendTime));

      setTimeout(function () {
        lastSendTime = Date.now();
        readAndUploadCycle();
      }, delay_1);
    });
  }
});

IsRealTimeAudioSource(audioSource: IAudioSource): Promise<boolean>{
  return audioSource.deviceInfo.onSuccessContinueWith((deviceInfo:IAudioDeviceInfo)=>{
    // Right now the only real time device we have
    if(deviceInfo.type == microphone){
       return true;
    } else {
       return false;
    }
}
LMIT-Dev commented 5 years ago

Yes, I see your point, I wasn't thinking about files at all.

Just to clearify, it IS okay to send data immediately IF the source is a real time device and not prerecorded / faster than real time? Cause your posts kinda made me think that there could be issues server-side, therefore i thought transmitting the data in packages of chunks would be a good idea.

If packaging the chuks would be desired, I think we could merge your and my changes. To be honest though, i dont really like how the stream reader would sort of control the data flow. Otherwise I guess your solution is all that's needed.

I am just unclear on something, will the StreamReader.read only return chunks that are filled completely or could your code (and my first) lead to sending data less than a chunks size? Would that be an issue?

rhurey commented 5 years ago

The server side concerns aren't a yes / no problem, more like a gradient where the extreme's bad outcomes can be predicted and the mid point is a bit more difficult to define. At the rate of 1/8th real time (What the browser throttling is getting) the service stalls and spends a lot of time (with allocated resources) waiting on data. At the rate of unlimited transmission the service has to queue and store information that's arriving faster than it can be processed.

Being bursty in the 1s range would have more negative effects on user perception of responsiveness than anything else. ("I said something over a second ago and it still isn't showing up")

There's already a chunking capability in the ChunkedArrayBufferStream that segments the audio data into chunks of the preferred size. Both the MicAudioSource and the PushStream use this. (PullStream has the same effect by continuing to query until it gets a filled buffer). So transmission wouldn't be exactly real time as much as "As soon as a transmit buffer filled"

We collect up the audio data in chunks (4K right now) that tries to strike a balance between network efficiency and responsiveness to user speech. If everything went as received, at least one Microphone implementation was handing off 88 byte audio frames. Sending audio in that size would result in a bad header / overhead to data ratio. Sending data every 1s (~32KB) would be more efficient, but the user would be frustrated with the response latency. Every 1/10th to 1/8th of a second is the target transmission frequency to balance those things.

LMIT-Dev commented 5 years ago

Hey,

I wanted to let you know, that the proposed solution works, we had issues with the connection breaking after approximately an hour, that's why we weren't sure if the the solution works. But it seems we are having a different issue.

(the WebSocket connection is lost although the token is updated on the SpeechRecognizer and SpeechConfig every 9 minutes, but on the fifth update, the SDK tries to create a new WebSocket and there the token is rejected with 'HTTP Authentication failed; no valid credentials available')

This is why it took so long to confirm the solution.

Thx for the help

This is the readAndUploadCycle function we are using:

            var readAndUploadCycle = function () {
                // If speech is done, stop sending audio.
                if (!_this.privIsDisposed && !_this.privRequestSession.isSpeechEnded && _this.privRequestSession.isRecognizing) {
                    _this.fetchConnection().on(function (connection) {
                        audioStreamNode.read().on(function (audioStreamChunk) {
                            // we have a new audio chunk to upload.
                            if (_this.privRequestSession.isSpeechEnded) {
                                // If service already recognized audio end then don't send any more audio
                                deferred.resolve(true);
                                return;
                            }
                            var payload;
                            var sendDelay;
                            if (audioStreamChunk.isEnd) {
                                payload = null;
                                sendDelay = 0;
                            }
                            else {
                                payload = audioStreamChunk.buffer;
                                _this.privRequestSession.onAudioSent(payload.byteLength);
                                if (maxSendUnthrottledBytes >= _this.privRequestSession.bytesSent) {
                                    sendDelay = 0;
                                }
                                else {
                                    sendDelay = Math.max(0, nextSendTime - Date.now());
                                }
                            }

                            _this.IsRealTimeAudioSource(_this.privAudioSource).onSuccessContinueWith(function (isRealTimeAudioSource) {
                                if (payload !== null) {
                                    nextSendTime = Date.now() + (payload.byteLength * 1000 / (audioFormat.avgBytesPerSec * 2));
                                }
                                var uploaded = connection.send(new SpeechConnectionMessage_Internal_1.SpeechConnectionMessage(Exports_2.MessageType.Binary, "audio", _this.privRequestSession.requestId, null, payload));

                                if (!audioStreamChunk.isEnd) {
                                    if (isRealTimeAudioSource) {
                                        uploaded.continueWith(function (_) {
                                            new Promise(function (resolve, reject) {
                                                readAndUploadCycle();
                                                resolve();
                                            });
                                        });
                                    } else {
                                        setTimeout(function () {
                                            uploaded.continueWith(function (_) {
                                                // Regardless of success or failure, schedule the next upload.
                                                // If the underlying connection was broken, the next cycle will
                                                // get a new connection and re-transmit missing audio automatically.
                                                readAndUploadCycle();
                                            });
                                        }, sendDelay);
                                    }
                                }
                                else {
                                    // the audio stream has been closed, no need to schedule next
                                    // read-upload cycle.
                                    _this.privRequestSession.onSpeechEnded();
                                    deferred.resolve(true);
                                }
                            });
                        }, function (error) {
                            if (_this.privRequestSession.isSpeechEnded) {
                                // For whatever reason, Reject is used to remove queue subscribers inside
                                // the Queue.DrainAndDispose invoked from DetachAudioNode down below, which
                                // means that sometimes things can be rejected in normal circumstances, without
                                // any errors.
                                deferred.resolve(true); // TODO: remove the argument, it's is completely meaningless.
                            }
                            else {
                                // Only reject, if there was a proper error.
                                deferred.reject(error);
                            }
                        });
                    }, function (error) {
                        deferred.reject(error);
                    });
                }
            };

Best Regards

rhurey commented 5 years ago

Great, glad to hear its working.

Any interest in sending a PR we can take?

LMIT-Dev commented 5 years ago

I'll gladly do that when i find the time to do it in TS... it's kinda busy here right now

oscholz commented 4 years ago

@LMIT-Dev We are going to close this item. We have entered a bug in our backlog, and will take your code and check it in, and we'll make sure all test cases pass. We apologize this took so long and we'll do better in the future. Once the code is in, we will update this issue to let you know what release will contain the fix. Thank you for using the JS Speech SDK, and THANK YOU for your contribution to make it better!

glharper commented 4 years ago

@LMIT-Dev We've checked in a fix for this issue. Thank you for bringing it to our attention, and writing code to address it. :-)