skit-ai / kaldi-serve

Server framework for Kaldi ASR Toolkit
Apache License 2.0
97 stars 24 forks source link

Bidi streaming proposal end of utterance detection #13

Open seyuf opened 4 years ago

seyuf commented 4 years ago

Hi,

Much thanks for this awesome work! I have a use case deriving from my use of the project. And I thought it was worth exposing here, as it believe it can be implemented directly on the main branch.

If i've already implemented some kind of PoC or v1 here.

The idea would be to, add silence/ end of utterance detection to the server. Today, what i observe is that in bidistreaming, the server is transcribing indefinitely streams of messages sent from the client. Appending the results at each iteration. So if one wants to reset (the result), one is forced to kill the connection, from the client.

What i made in the above link is kinda similar, i just send from the client side in the audio config message end_of_utterance value, which tells the server im done. Send me the last result and close the connection. I also set in the last result massage, some is_final value signalling that this is the last result from the server and that the connection has been closed to the client. Although this works, it is not very satisfying, as to me the right thing would be the keep the connection alive but just reset the results when an utterance has ended. I also believe that the server could also do the end of utterance detection using silence detection.

The idea would be to consider that was at the end of an utterance, if we receive silent audio for some amount of time or iteration (the code seems already in place here) So:

  1. client specify in the message /audio config if it would like the server to detect the end of utterances. (if not we keep the current behaviour)
  2. Client sends streams of messages
  3. After multiple consecutives empty audio decoding the server decides, we're at an end of utterance
  4. Server send back result with ( is_final set to true in the response message).
  5. Server reset data, but keeps connection alive (or may be killing it? Could be optional), waiting for new input from client.

I hope it the understandable enough? If so i would like some feedback, if possible?

Regards

lepisma commented 4 years ago

Hey, @seyuf can we reopen this? the feature is something we haven't considered yet but will like to have some discussion before closing.

Not guarantying a discussion now but let's keep this open :)

seyuf commented 4 years ago

Hi, sure np.