k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
515 stars 103 forks source link

What is the correct way to know if all frames of a stream have been decoded in python code? #411

Closed trunglebka closed 1 year ago

trunglebka commented 1 year ago

Currently to fully decode a stream we need to add tail padding to make last "user's" chunk ready. But the way the number (here 0.3) was chose is a magic that may become incorrect if chunk size is larger than 0.3s + remaining chunk.

In my use case, I decoupled decoding process and communication/app logic pretty much and there is no clean way to detect if a stream has been completed (E.g: client code is unable to call self.compute_and_decode) so it would be helpful to provide a way to know such event.

https://github.com/k2-fsa/sherpa/blob/da0d391638fcdd38e528f145ac79ebfda41e0dd8/sherpa/bin/streaming_server.py#L673-L683

csukuangfj commented 1 year ago

I decoupled decoding process and communication/app logic pretty much and there is no clean way to detect if a stream has been completed (E.g: client code is unable to call self.compute_and_decode) so it would be helpful to provide a way to know such event.

Sorry, I don't quite understand it. Could you please explain it a bit?

trunglebka commented 1 year ago

Well, that is my code choice so you may not need to care (but if you want, I can send it via email). The main problem I want to point out is that there is no way to know if a stream has been finished decoding without padding tail frames (and how many frames needed is a question too).

I've tested my code with 0.3s tail padding and in some case I'm unable to finish a stream (seem it is stuck at checking recognizer.is_ready) but if increase it to 1s it is OK.

csukuangfj commented 1 year ago

and how many frames needed is a question too

https://github.com/k2-fsa/sherpa/blob/da0d391638fcdd38e528f145ac79ebfda41e0dd8/sherpa/csrc/online-transducer-model.h#L102 You can pad as many as the the return value of the above function. (The padded samples should compute that number of feature frames).


The main problem I want to point out is that there is no way to know if a stream has been finished decoding

If the server knows that the client has sent all the data and if is_ready() returns false, we know it reaches the end.

(The client can send a message to the server saying that it won't send any samples in the future)

trunglebka commented 1 year ago

Maybe I have some misunderstand. Please help me next question. Is stream.accept_waveform() process frames immediately or leave for some background thread and that make processed features available later?

csukuangfj commented 1 year ago

Is stream.accept_waveform() process frames immediately

accept_waveform() does not return until it has processed all the input samples. After it returns, if there are enough feature frames, then is_ready() will return true, otherwise, is_ready() will return false.

csukuangfj commented 1 year ago

If is_ready() returns false and you never invoke accept_waveform() again, then is_ready() keeps returning false.

trunglebka commented 1 year ago

Well, so I've made wrong assumption since python decoding process can go above 100% CPU and after accepting tail padding waveform recognizer complains that stream has not ready yet.

Is there any way we can get model chunk size in python?

csukuangfj commented 1 year ago

Is there any way we can get model chunk size in python?

Yes, there are at least two ways to get that information.

(1) Without code changes.

Follow https://github.com/k2-fsa/sherpa/blob/c7f88048241a2d7d16be4fedeeccdc6b7525fabe/sherpa/csrc/online-zipformer-transducer-model.cc#L31 to get such information from a model created by torch.jit.load() in Python

(2) With code changes Wrap https://github.com/k2-fsa/sherpa/blob/c7f88048241a2d7d16be4fedeeccdc6b7525fabe/sherpa/csrc/online-transducer-model.h#L102 to Python. This requires that you have a basic knowledge of Pybind11.

trunglebka commented 1 year ago

Thank you. However, I'm feeling that fixing that number in code may cause conflict at some point latter if icefall change subsampling factor.