Question Clarification on streaming decoding for HLG

kbramhendra commented 11 months ago

Hi, I am using conformer ctc with HLG decoding for streaming. I have used the implementation mentioned in #1218 online_decode.py. While decoding longer calls I am getting higher latency. In this my question is the current_state_info object for OnlineIntersect contains all the history of previous chunks or is it only limited to few previous chunks. It seems like its taking all the previous history and I am getting OOM for longer calls like > 20 min. Do i have to implement end pointing or something similar for the Online Intersector or it automatically takes care ?...It seems it doesn't do that. Can you please shed some light on this. Thank you.

pkufool commented 11 months ago

Yes, both of our online decoding implemented on GPU (RNNT & Online CTC) will keep all the history of previous chunks. This suffers from higher latency and OOM for long utterances. I think you need an end pointer for long audios.

kbramhendra commented 11 months ago

Thank you for the answer, really helpful.

kbramhendra commented 9 months ago

Hi, in this online decoding how to keep only previous chunk history, i mean how to update the decoder states in such a way that it only keeps previous history ? can i use the pop function ? and how to find out the length and size of decoder states. sizeof, and len functions are returning constant 48 and 1 for all the time?

pkufool commented 9 months ago

@kbramhendra Which decoder states? Could you point me to the code? Sorry, I don't get your idea of "only keeps previous history", could you explain further, an example would be better, thanks!

kbramhendra commented 9 months ago

@pkufool apologies for lack of clarity. I am using conformer ctc with HLG decoding for streaming. I have used the implementation mentioned in https://github.com/k2-fsa/k2/pull/1218 online_decode.py (line no 175 to 179). From the earlier explanation I understood that current_state_infos object carries history of all the previous chunk history. Because of this i was getting OOM and latency increase for long calls. I have tried using end pointing, it worked for me.

In this I am trying to explore the effect of the previous chunks history. I am trying to keep only previous chunk history instead of all the previous chunk or before endpoint history. How to dynamically update the current_state_infos to only previous chunk history is my question? How to acheive this ?

The current_state_infos object had functions, in which pop and delitem were there. I tried using pop i am getting some errors. Is this the right way ?

pkufool commented 9 months ago

@kbramhendra Sorry for the late reply.

I have tried using end pointing, it worked for me.

So when you meet endpoint, you initialize a new state_info, right?

How to dynamically update the current_state_infos to only previous chunk history

To my knowledge, it is hard to keep only previous chunk history. The state_info is actually a RaggedTensor indexed with [frame][state] and [frame][state][arc], see https://github.com/k2-fsa/k2/blob/45450bff1fe3e2d4a4654ee7698b04c41740e872/k2/csrc/intersect_dense_pruned.h#L118-L138

To keep only previous chunk, you have to slice on frame dimension, that's doable, but I am afraid the FormarOutput (to generate lattice) requires the first state of state_info is the start state of the decoding graph. So such a slicing might cause a failure when generating the lattice.

From my point of view, the previous chunks history makes little difference for final result, the CTC system does not depend on the previous frames, I think you can initialize a new state_info when meetting an endpoint directlly.

One easy way to explore the effect of previous chunks is to initialize the state_info with state_info of previous segment (I mean the segments split by endpoint), by doing this way, you can keep several previous chunks history. I guess this won't raise any errors.

The current_state_infos object had functions, in which pop and delitem were there. I tried using pop i am getting some errors. Is this the right way ?

The current_state_infos is a List of state_info for current decoding streams, so pop and del is on sequence level not the frame level. If you want to pop and delete on frame level, you have to add some C++ code to do that. But I think it doesn't make sense to do so, see comments above.

kbramhendra commented 9 months ago

@pkufool Thank you for the detailed explanation. I highly appreciate it.

So when you meet endpoint, you initialize a new state_info, right?

Yes , I am initializing a new state.

Here my actual goal is to increase the number of streams that I can process. As of now i am only able to process 100 to 150 streams. I am guessing these current_state_infos could be a blocker to further increase.

From my point of view, the previous chunks history makes little difference for final result, the CTC system does not depend on the previous frames, I think you can initialize a new state_info when meetting an endpoint directlly.

Like you mention as ctc doesn't depend on previous history, i am trying to keep the history only previous one. I could update after every chunk but thats giving poor result.

Neverthless thanks for the reply. I will see what i can do to up the num of streams

k2-fsa / k2

Question Clarification on streaming decoding for HLG #1242