Support for VAD / endpointing

shaynemei commented 2 years ago

Does the sherpa server support VAD or endpointing?

csukuangfj commented 2 years ago

For the endpointing stuff, can we implement it by counting the number of contiguous of frames that are decoded to blanks?

shaynemei commented 2 years ago

Counting silence frames sounds like a good idea. Maybe we can make it more robust by adding a state machine with two states (one for silence, one for non-silence). Where in the streaming decoding pipeline should we add this? e.g. Where in the code can I access the blank frames?

csukuangfj commented 2 years ago

Maybe we can make it more robust by adding a state machine with two states (one for silence, one for non-silence).

I think we can add an attribute num_trailing_silence_frames to Stream:

It defaults to 0
When a non-blank token is decoded, it is reset to 0
When a frame is decoded to a blank, it is incremented.

Where in the streaming decoding pipeline should we add this? e.g. Where in the code can I access the blank frames?

There are two places where we can do this.

(1) In Python, which is more straightforward but the computed number is an approximate number

We can change https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/bin/streaming_pruned_transducer_statelessX/beam_search.py#L187

to

if len(self.hyp) != len(next_hyp_list[i]):
  # At least one new non-blank token is decoded in this chunk
  stream.num_trailing_silence_frames = 0
else:
  # All frames in this chunk are decoded to blanks
  stream.num_trailing_silence_frames += int(encoder_out_lens[i])

stream.hyp = next_hyp_list[i]

(2) In C++, it requires more effort but it is accurate. Taking streaming greedy search as an example:

https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/csrc/rnnt_beam_search.cc#L240-L241

Whenever a new non-blank token is decoded, we reset the number of trailing silence frames counter for the corresponding stream. We need to return the counter from C++ to Python.

Note: If you use fast_beam_search, changing the code in Python may be the only feasible way.

danpovey commented 2 years ago

The blank idea seems good to me. Also we might want to wait longer if no non-blank symbols have been decoded so far (on the best path). In Kaldi, there are a few inputs to this computation:

the EOS probability given the current LM state. [Actually it's based on the probability difference between the best-path latest token and the best EOS probability].
Have we decoded non-silence so far?
How many silence frames have we seen so far, on the best path?
How many frames total have we decoded? If possible, I think a good way to arrange it would be that the user can pass in a function that takes those inputs, in case they want to override the default rule. But IDK whether this might be compatible with other design decisions that have already been made.

Incidentally, in the longer future, I think it would be good to train some kind of neural network output that corresponds to an end of utterance probability. E.g. something like Google's version of RNN-T where it learns EOS as an extra symbol.

shaynemei commented 2 years ago

Thanks for the suggestion. I have two other questions regarding using the endpoint information:

What is the plan for sending endpoint to the client? e.g. how will it be incorporated to hyp info?
Should we have an option to not send out previous results at each endpoint? For example, resetting the stream.hyp after each VAD segment is sent out as final results?

https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/bin/streaming_pruned_transducer_statelessX/beam_search.py#L354

This should be desirable in cases of very long audio input where we won't want the final result to include all the decoded results, but instead send out each VAD segment as final results.

Also, I'm confused about that line of code above. I read in the icefall recipes that the context_size means:

Number of previous words to use to predict the next word. 1 means bigram; 2 means trigram. n means (n+1)-gram.

https://github.com/k2-fsa/icefall/blob/48a6a9a54918874736b9f5b0235c76b63e266807/egs/librispeech/ASR/transducer_stateless/decoder.py#L50-L52

But that line of stream.hyp slicing seems to get all previous words starting the context_size position?

csukuangfj commented 2 years ago

But that line of stream.hyp slicing seems to get all previous words starting the context_size position?

We are getting the final results in that line. [context_size:] means to skip the first context_size blank tokens.

csukuangfj commented 2 years ago

https://github.com/k2-fsa/sherpa/blob/d958043f1680bdef666a97699d2aaa9fcd4c91fa/sherpa/csrc/rnnt_beam_search.cc#L73-L77

We are indeed only using the last context_size tokens to predict the next token.

But think of the very beginning. We need to add context_size blank tokens at the very start. That is why we have to remove them when we get the final results.

k2-fsa / sherpa

Support for VAD / endpointing #96