k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.11k stars 213 forks source link

Big gap in WER between online and offline CTC decoding #1194

Open chiendb97 opened 1 year ago

chiendb97 commented 1 year ago

I tried offline decoding using hlg_decode.cu and online decoding using online_decode.cu. And here is the result:

Could you please tell me the difference between offline decoding and online decoding? In addition, could you tell us the result of 2 kinds of decoding. Thanks!

danpovey commented 1 year ago

There are examples in Sherpa of real-time/streaming/online decoding, I think that might be a better starting point? Normally you need to use a model that has been trained with streaming in mind.

chiendb97 commented 1 year ago

There are examples in Sherpa of real-time/streaming/online decoding

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Normally you need to use a model that has been trained with streaming in mind.

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

pkufool commented 1 year ago

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Sorry, there is no ctc hlg streaming decoding in Sherpa, only one example in k2/torch/bin (I think it is the online_deocde.cu you used).

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

We normally test the streaming decoding method with a streaming model, may be you can try the online_decode.cu with a streaming model. A offline model is not suitable for a streaming decoding method.

danpovey commented 1 year ago

But @pkufool I think that binary just evaluates the nnet for the entire file and simulates streaming, so surely it should in principle give the same results as the offline decoding if it was given a non-streaming model? (Even though this would not be useful in practice).

chiendb97 commented 1 year ago

@pkufool @danpovey How I tested was that I read the audio file and evaluated nnet output for the entire audio. Then I used that output to simulate streaming as in online_decode.cu and used the final text result to compute the WER. I did the test twice, using the conformer ctc model from icefall and my conformer ctc model (using wenet). However, the results obtained were not as good as offline decoding in both cases. I tried to print out the lattice (lattice.fsa.values) of the online decoder and noticed that the first few lattices are quite the same as that of the offline decoder. But then it started to differ.

danpovey commented 1 year ago

hm, how did it differ? @pkufool do you think there is possibly a bug that is affecting him? @chiendb97 what version of k2 are you using? see if a newer version helps.

chiendb97 commented 1 year ago

what version of k2 are you using? see if a newer version helps.

I am using the latest version of k2.

pkufool commented 1 year ago

@pkufool do you think there is possibly a bug that is affecting him?

Yes, I think there could be some bugs. I will look into the code.

svandiekendialpad commented 1 year ago

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

pkufool commented 1 year ago

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

OK, I am debuging it.

svandiekendialpad commented 1 year ago

Any updates @pkufool?

pkufool commented 1 year ago

Any updates @pkufool?

Sorry, I did not fix it at that day and forgot it, will return to it.

pkufool commented 1 year ago

@svandiekendialpad @chiendb97 Does the differences only happens when using --use_ctc_decoding=false (i.e decoding with an ngram) ?

binhtranmcs commented 1 year ago

Hi @pkufool, I just ran tests again using librispeech conformer ctc, here is the result:

So I think there is still a significant difference between online and offline implementations regardless of using n-gram (though the gap is smaller).

svandiekendialpad commented 1 year ago

I can confirm what @binhtranmcs said. It all points to a bug in the online decoding code.

pkufool commented 1 year ago

@binhtranmcs I think https://github.com/k2-fsa/k2/pull/1218 solve some problems, but it still has differences between the lattices generated by online & offline mode, now I know it relates to the pruning, I am trying to fix it.

pkufool commented 1 year ago

@danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.

danpovey commented 1 year ago

Does the backward pass start with -(forward score) on all active states? That's how it is supposed to work.

On Tue, Jul 11, 2023, 10:20 AM Wei Kang @.***> wrote:

@danpovey https://github.com/danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1194#issuecomment-1630370874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2RPNA2A6ULSP7WKH3XPUEF5ANCNFSM6AAAAAAX6JHOAI . You are receiving this because you were mentioned.Message ID: @.***>

binhtranmcs commented 1 year ago

Hi @danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.

danpovey commented 1 year ago

I think it is described in my paper about exact lattices.. or at least mentioned there. Pruned viterbi beam search with some extensions to store a lattice. The guys have discovered the problem but IDK if they have made the fix public yet.

On Sun, Jul 16, 2023, 5:56 PM binhtranmcs @.***> wrote:

Hi @danpovey https://github.com/danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1194#issuecomment-1637040507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7J4QLWBIEAYHIIKDLXQO3DVANCNFSM6AAAAAAX6JHOAI . You are receiving this because you were mentioned.Message ID: @.***>

pkufool commented 1 year ago

@binhtranmcs @svandiekendialpad @chiendb97 I think https://github.com/k2-fsa/k2/pull/1218 can fix this issue, you can try it on your dataset.

binhtranmcs commented 1 year ago

@pkufool, I just tested again with librispeech conformer ctc, using online_decode.cu:

WER for online hlg decoding did decrease(from 18% down to 12%) but it is not as good as offline decoding(3.49%). I think there are still problems here.

svandiekendialpad commented 1 year ago

For me it went up from 33% to 45%, when 14% should be normal. Should I have used allow_partial anywhere? I just left it at its default (true in OnlineDenseIntersecter).

pkufool commented 1 year ago

@binhtranmcs @svandiekendialpad OK, I just tested some bad cases, will test the full test datasets.

binhtranmcs commented 1 year ago

Hi @pkufool, are there any updates on this???

danpovey commented 1 year ago

I think #1218 may be relevant to this. Not merged yet but says it is ready.

pkufool commented 1 year ago

I think #1218 may be relevant to this. Not merged yet but says it is ready.

It a pity that the fixes in #1218 can not fix all the issue, I am still debuging it.

pkufool commented 1 year ago

I did some exps on librispeech test-clean, here is the results: For ctc-decoding (decode with a ctc-topology),after applying the fixes in #1218 I can get almost the same WERs for online and offline.

  Offline Online (chunk=10)
Ctc-decoding 2.99 2.92

For hlg decoding (decode with an HLG), there are still big difference between online and offline, mainly the deletions at the tail of sentences.

  Offline Online(chunk=10) Online(chunk=30) Online(chunk=50) Online(chunk=30)decoding_graph.scores = 0.0
Hlg decoding 2.77 19.06 6.93 5.13 3.02

I believe this is the issue of pruning at the boundary frames (as I mentioned above). When I set the output_beam (used in backward pruning) the same as the search_beam (used in forward pruning) I can get the same results.

  Offline Online(chunk=10) Online(chunk=10) output-beam=search-beam
Hlg decoding 2.77 19.06 2.73

I need to revisit the implementation carefully to figure out the fixes for this issue, for now I think you can try using the same output_beam and search_beam.

[edit:] BTW, I add the python test code in #1218 online_decode.py and hlg_decode.py which accept a wav scp, then you can use simple-wer to calculate the WERs.

danpovey commented 1 year ago

@pkufool this makes me think that the backward scores have not been initialized correctly. They are supposed to be set to -(the forward score) when we do "intermediate" pruning (i.e. pruning not at the end of the file). If that is done, it should be OK to prune using "output_beam". I suspect that something is not working right in this respect: for example, they are not being set to that value, or they are being overwritten somehow, or something to do with a final-state is not correct.

pkufool commented 1 year ago

@binhtranmcs @svandiekendialpad @chiendb97 I update #1218 I think this time it should be able to fix your issue.

svandiekendialpad commented 1 year ago

@pkufool I'm trying to replicate your results, for now I still have very high error rate due to deltions. I am therefore investigating whether my custom decoder implementation has a bug.

However, could you send me a short code snippet how you set the decoding graph scores to 0.0? I just set HLG.scores = torch.zeros(HLG.scores.shape) and it leads to an AssertionError in parse_timestamps_and_texts, where I end up with fewer index_pairs than words/tokens. This doesn't happen when the scores aren't zero.

desh2608 commented 1 year ago

I think you can simply do HLG.scores *= 0. I guess HLG.scores is a RaggedTensor and so its shape attribute actually refers to an underlying RaggedShape (and not a torch Tensor).

svandiekendialpad commented 1 year ago

For me HLG.scores is a torch.Tensor.

pkufool commented 1 year ago

@svandiekendialpad I did test the fixes on test-clean with the model librispeech conformer ctc, and I got 2.73% for online decoding (the online_decode.py). Can you try your test set with my script? (i.e. online_decode.py in #1218). Let me know, if you meet some troubles, thanks!

videodanchik commented 11 months ago

Hi @pkufool, thanks for your effort on resolving this issue. I've downloaded librispeech conformer ctc and latest librispeech zipformer, trained with both ctc and rnnt losses. I decoded test-clean and test-other with both models online (chunk = 15) and offline before and after the fix from https://github.com/k2-fsa/k2/pull/1218.

Results before the fix (HLG decoding presented with different acoustic_model_weight = 1 / lm_model_weight):

decoding type test_clean (conformer / zipformer) test_other (conformer / zipformer)
H online 2.86 / 2.35 7.46 / 5.67
HLG online am scale 1 20.87 / 20.90 21.09 / 20.62
HLG online am scale 2 23.14 / 23.89 23.17 / 23.16
HLG online am scale 3 23.70 / 24.69 23.68 / 23.83
HLG online am scale 4 23.97 / 25.13 23.89 / 24.18
H offline 2.86 / 2.35 7.46 / 5.67
HLG offline am scale 1 2.68 / 2.60 6.43 / 5.42
HLG offline am scale 2 2.70 / 2.39 6.36 / 5.12
HLG offline am scale 3 2.71 / 2.39 6.47 / 5.14
HLG offline am scale 4 2.73 / 2.40 6.54 / 5.16

Results after the fix:

decoding type test_clean (conformer / zipformer) test_other (conformer / zipformer)
H online 2.86 / 2.35 7.46 / 5.67
HLG online am scale 1 2.68 / 2.59 6.43 / 5.42
HLG online am scale 2 2.70 / 2.39 6.36 / 5.12
HLG online am scale 3 2.72 / 2.39 6.47 / 5.15
HLG online am scale 4 2.74 / 2.40 6.56 / 5.17
HLG online am scale 5 2.74 / 2.40 6.61 / 5.21
H offline 2.86 / 2.35 7.46 / 5.67
HLG offline am scale 1 2.68 / 2.59 6.43 / 5.42
HLG offline am scale 2 2.70 / 2.39 6.36 / 5.12
HLG offline am scale 3 2.71 / 2.39 6.47 / 5.14
HLG offline am scale 4 2.73 / 2.40 6.55 / 5.16
HLG offline am scale 5 2.74 / 2.40 6.58 / 5.19

So, online decoding works well now, I also went through the code with @svandiekendialpad and we sort things out, everything works as expected. @pkufool Can we consider merging https://github.com/k2-fsa/k2/pull/1218 to master as this is really important fix? I see you were asked in https://github.com/k2-fsa/k2/pull/1218 to add the allow-partial option for k2.intersect and k2.intersect_device, is it possible to elaborate on this or merge it as is?

pkufool commented 11 months ago

@videodanchik Thanks very much for the testing! Yes I will have a look at the failed CI tests and merge it.

I see you were asked in https://github.com/k2-fsa/k2/pull/1218 to add the allow-partial option for k2.intersect and k2.intersect_device, is it possible to elaborate on this or merge it as is?

Actually, I have not started this work yet, will make a seperate PR later.