k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
518 stars 104 forks source link

Support for VAD / endpointing #96

Open shaynemei opened 2 years ago

shaynemei commented 2 years ago

Does the sherpa server support VAD or endpointing?

csukuangfj commented 2 years ago

For the endpointing stuff, can we implement it by counting the number of contiguous of frames that are decoded to blanks?

shaynemei commented 2 years ago

Counting silence frames sounds like a good idea. Maybe we can make it more robust by adding a state machine with two states (one for silence, one for non-silence). Where in the streaming decoding pipeline should we add this? e.g. Where in the code can I access the blank frames?

csukuangfj commented 2 years ago

Maybe we can make it more robust by adding a state machine with two states (one for silence, one for non-silence).

I think we can add an attribute num_trailing_silence_frames to Stream:


Where in the streaming decoding pipeline should we add this? e.g. Where in the code can I access the blank frames?

There are two places where we can do this.

(1) In Python, which is more straightforward but the computed number is an approximate number

We can change https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/bin/streaming_pruned_transducer_statelessX/beam_search.py#L187

to

if len(self.hyp) != len(next_hyp_list[i]):
  # At least one new non-blank token is decoded in this chunk
  stream.num_trailing_silence_frames = 0
else:
  # All frames in this chunk are decoded to blanks
  stream.num_trailing_silence_frames += int(encoder_out_lens[i])

stream.hyp = next_hyp_list[i]

(2) In C++, it requires more effort but it is accurate. Taking streaming greedy search as an example:

https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/csrc/rnnt_beam_search.cc#L240-L241

Whenever a new non-blank token is decoded, we reset the number of trailing silence frames counter for the corresponding stream. We need to return the counter from C++ to Python.

Note: If you use fast_beam_search, changing the code in Python may be the only feasible way.

danpovey commented 2 years ago

The blank idea seems good to me. Also we might want to wait longer if no non-blank symbols have been decoded so far (on the best path). In Kaldi, there are a few inputs to this computation:

Incidentally, in the longer future, I think it would be good to train some kind of neural network output that corresponds to an end of utterance probability. E.g. something like Google's version of RNN-T where it learns EOS as an extra symbol.

shaynemei commented 2 years ago

Thanks for the suggestion. I have two other questions regarding using the endpoint information:

  1. What is the plan for sending endpoint to the client? e.g. how will it be incorporated to hyp info?
  2. Should we have an option to not send out previous results at each endpoint? For example, resetting the stream.hyp after each VAD segment is sent out as final results?

https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/bin/streaming_pruned_transducer_statelessX/beam_search.py#L354

This should be desirable in cases of very long audio input where we won't want the final result to include all the decoded results, but instead send out each VAD segment as final results.

Also, I'm confused about that line of code above. I read in the icefall recipes that the context_size means:

Number of previous words to use to predict the next word. 1 means bigram; 2 means trigram. n means (n+1)-gram.

https://github.com/k2-fsa/icefall/blob/48a6a9a54918874736b9f5b0235c76b63e266807/egs/librispeech/ASR/transducer_stateless/decoder.py#L50-L52

But that line of stream.hyp slicing seems to get all previous words starting the context_size position?

csukuangfj commented 2 years ago

But that line of stream.hyp slicing seems to get all previous words starting the context_size position?

We are getting the final results in that line. [context_size:] means to skip the first context_size blank tokens.

csukuangfj commented 2 years ago

https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/csrc/rnnt_beam_search.cc#L73-L77

We are indeed only using the last context_size tokens to predict the next token.

But think of the very beginning. We need to add context_size blank tokens at the very start. That is why we have to remove them when we get the final results.