Comparison between k2 CTC loss and PyTorch CTC loss

zhu-han commented 3 years ago

Has anyone compaired the performance between k2 CTC loss implementation and the CTCLoss in PyTorch?

I tried to write a K2CTCLoss with k2 to replace torch.nn.CTCLoss and did some experiments using ESPnet. It shows there is a gap between K2CTCLoss and torch.nn.CTCLoss.

The experiments are conducted on Librispeech 100h and the training criterion is CTC only. Acoustic model is BLSTM or Transformer based encoder. For CTC modeling unit, I tried char and bpe 5000. Here are some conclusions of my experiments:

K2CTCLoss could work with BLSTM based acoustic model, though torch.nn.CTCLoss could reduce the loss faster;
K2CTCLoss did't work with Transformer. When using bpe 5000 as CTC modeling unit, the loss curve of K2CTCLoss would be like : In comparison, torch.nn.CTCLoss with Transformer is like:
The above conclusions are the same when CTC modeling unit is char or bpe 5000.
In snowfall, the CTC implementation is (1) acoustic feature->phone->word. I did a experiment using the K2CTCLoss with (2) acoustic feature->char structure. And the WERs are (1) 12.84% and (2) 15.99% respectively. So I think the K2CTCLoss implementation should be fine.

Could anyone give me some advice on how to make it work better? And does anyone know why it can't work well with transformer? Thanks!

danpovey commented 3 years ago

That's interesting. It's possible that it could be a bug in k2, but there are many places it could be. I checked the documentation for torch.nn.CTCLoss but it is a little vague so it's hard to know whether they are attempting to implement the same thing as us. One thing you could do which would be helpful to us is to try to evaluate k2's version of the loss given the model trained with PyTorch. If it looks similar to PyTorch's loss, it would likely indicate a bug in computing derivatives.

danpovey commented 3 years ago

Also it would be nice if someone could compute the sum of the derivative (.grad) of our CTC loss and make sure the sum on each frame of each sequence is close to 1.0. [if we can somehow access the .grad w.r.t. the nnet output].

zhu-han commented 3 years ago

Using random initialized transformer model, the first 10 iteration loss computed with K2CTCLoss and torch.nn.CTCLoss are: iteration	K2CTCLoss	torch.nn.CTCLoss
1	3379.74	3385.51
2	2644.70	2643.39
3	2760.64	2765.41
4	2593.41	2595.71
5	2360.99	2363.25
6	2351.16	2346.32
7	3471.35	3478.80
8	2540.69	2540.17
9	2953.67	2955.29
10	2190.49	2189.26

Using transformer model trained with torch.nn.CTCLoss for 10 epoch, the first 10 iteration loss computed with K2CTCLoss and torch.nn.CTCLoss are:

iteration	K2CTCLoss	torch.nn.CTCLoss
1	862.52	35.42
2	782.07	35.21
3	870.66	39.65
4	804.01	27.49
5	821.23	37.25
6	806.56	36.18
7	717.92	31.09
8	705.32	32.66
9	829.99	28.65
10	749.98	27.37

It seems that we can get a similar loss value with random initialized model but not a pretrained model.

danpovey commented 3 years ago

Thanks a lot!! For the transformer model, can you clarify how you were training it? Was it with one of the two CTC losses?

danpovey commented 3 years ago

And in the 2nd table, can you clarify if you were training with the same loss functions you were evaluating? What I want is for you to train with one loss and also evaluate the objectives with the other. To see if the actual loss calculation is the same (might be bug in derivative computation)

zhu-han commented 3 years ago

In the 2nd table, the pretrained transformer model is trained with torch.nn.CTCLoss only. And then the training and loss calculation used the same loss function.

danpovey commented 3 years ago

OK, but what I want is for you to train with the torch loss and evaluate with k2 CTC loss, with the same model. So same code will evaluate 2 objectives.

With the random transformer model, what are the iterations? That is, what objective are you training with?

zhu-han commented 3 years ago

Sorry for the misunderstanding. ~~When trained with torch.nn.CTCLoss and also evaluate the K2CTCLoss, no matter with a random initialized or a pretrained model, the two loss value is the same.~~

In the 1st table above (random transformer model results), the two column are trained with K2CTCLoss and torch.nn.CTCLoss as objective respectively.

danpovey commented 3 years ago

So if you train with the PyTorch loss and evaluate also with the k2 one, you'll get the same value? Because in iteration 1 of your 2nd table, they're very different... if you showed iteration 0, would the k2 one be the same?

zhu-han commented 3 years ago

I checked the code and find a bug which accidently make the two loss function the same.

The real results is : with a random initialized model, the two losses are similar. With a pretrained model, the two losses are very different. I will paste the results bellow.

zhu-han commented 3 years ago

The training objective is torch.nn.CTCLoss and the evaluation is performed on both K2CTCLoss and torch.nn.CTCLoss

Using random initialized transformer model

iteration	K2CTCLoss	torch.nn.CTCLoss
1	3379.74	3385.51
2	2644.70	2643.39
3	2760.65	2765.41
4	2593.45	2595.71
5	2361.09	2363.25
6	2351.38	2346.32
7	3471.62	3478.80
8	2540.69	2540.17
9	2953.91	2955.29
10	2191.66	2189.26

Using pretrained transformer model which trained with torch.nn.CTCLoss as objective for 10 epoch.

iteration	K2CTCLoss	torch.nn.CTCLoss
1	862.52	35.42
2	782.09	35.21
3	870.74	39.65
4	804.16	27.49
5	821.47	37.25
6	806.93	36.18
7	718.35	31.09
8	705.85	32.66
9	830.93	28.65
10	750.99	27.37

danpovey commented 3 years ago

OK. WIthout seeing the code it will be hard to comment much further or help debug. Fanjun says he will try to debug the derivatives of the k2 loss over the weekend.

zhu-han commented 3 years ago

Thanks a lot for your help! If anyone is interested, my K2CTCLoss implementation is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py.

danpovey commented 3 years ago

[re-posting directly, mail is unreliable.] You are not using 'indices' to sort the FSAs in the graphs. I'm not sure if our Fsa object has an operator [] that can take a Tensor, but it might.

Basically, your graphs are in the wrong order. You could also possibly reorder targets and target_lengths before compiling the graph.

danpovey commented 3 years ago

You are not using 'indices' to sort the FSAs in the graphs. I'm not sure if our Fsa object has an operator [] that can take a Tensor, but it might.

Basically, your graphs are in the wrong order. You could also possibly reorder targets and target_lengths before compiling the graph.

On Fri, Jan 8, 2021 at 10:19 PM Han Zhu notifications@github.com wrote:

Thanks a lot for your help! If anyone is intrested, my K2CTCLoss implementation is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-756778842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6YXZHFVCXH35WNLXLSY4H7JANCNFSM4V2H7NEQ .

danpovey commented 3 years ago

possibly

decoding_graph = k2.index(decoding_graph, indices)

would work (not sure though)

zhu-han commented 3 years ago

Thanks for your help! I will change my code accordingly and do the experiments.

sw005320 commented 3 years ago

@zhu-han, thanks for sharing your interesting report. I will also take a look at this. We (@brianyan918) are also working on comparing pytorch CTC, warpCTC, k2 CTC, and gtn CTC.

zhu-han commented 3 years ago

After fixing the graph order issue, K2CTCLoss could work with Transformer now. With bpe 500 as CTC modeling unit, the loss curve is like: modified_k2_ctc_loss

And previous results make sense now. Before making batch, the training samples is sorted according to input length. So with a smaller batch size, all samples in the same batch is more likely to have the same length. With the same length, the sorted text could match the order of unsorted graph. In my experiments, BLSTM has smaller batch size than Transformers (20 vs 256), so BLSTM suffers less than Transformer because of this bug. That's why BLSTM could work and Transformer could not work in previous results.

Thanks a lot!

zhu-han commented 3 years ago

@sw005320 My revised K2CTCLoss is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py. I will be glad to help on this.

csukuangfj commented 3 years ago

I just added gradients test for k2 CTC loss. Please see https://github.com/k2-fsa/k2/pull/577

It shows that k2 CTC loss is identical to PyTorch CTC loss and warp-ctc when they are given the same input.

The gradients of k2 and PyTorch are also the same.

zhu-han commented 3 years ago

Thanks! But since I find models trained with k2 CTC loss and PyTorch CTC loss did have some differences, I added additional test cases baed on test_random_case1 in ctc_gradients_test.py to check it. Here are some results:

When I run this test case directly, it could pass;
When I changed parameters T and C to match my experiment's setup, i.e., T = 400 (16s training sample and 4 × subsampling factor), C = 5000 (with bpe 5000 as CTC modeling unit), this test case is failed. Specifically ,the gradient check assert torch.allclose(torch_activation.grad, k2_activation.grad, atol=1e-2) is failed.
When I keep T as orignial and only change C to 5000, the gradient check is passed. But when I keep C and change the sample length T to 400, and the gradient check is also failed.

It seems with longer samples, the difference is larger.

zhu-han commented 3 years ago

And these are the results I got on librispeech 100h using PyTorch CTC loss and k2 CTC loss:

PyTorch CTC loss:

Criterion	Test clean	Test other
CTC	17.1	35.9
Hybrid CTC/Attention	10.3	27.1

k2 CTC loss:

Criterion	Test clean	Test other
CTC	17.3	36.4
Hybrid CTC/Attention	10.6	27.5

Detailed setup:

k2 CTC loss: In k2.intersect_dense(), set output_beam = 10.0
Training: For both criterions, SpecAugument is not used. For CTC, epochs=30, batch size = 256. For hybrid CTC/Attention, epochs = 80, CTC weight = 0.3.
Decoding: For both criterions, best 5 models based on validation performance are averaged to get the final model, beam size = 10, and language model is not used. For hybrid CTC/Attention, CTC weight = 0.4.
Model: For hybrid CTC/Attention, Transformer with 12 encoder layers and 6 decoder layers. Attention heads = 4, attention dimension = 256, feed forward dimension = 2048. For CTC, same encoder structure is used.

danpovey commented 3 years ago

Cool! Regarding the gradient-check: sometimes there can be roundoff error that causes the posteriors on some frames to sum to a number different than 1. Can you compute those sums? I.e. the sum of the grad, per frame...

zhu-han commented 3 years ago

Given the same input, the PyTorch CTC gradient sum per frame is: [ 0.0000e+00, 2.3842e-07, -3.5763e-07, -2.3842e-07, -3.5763e-07,...] and the k2 CTC gradient sum per frame is: [-1.1921e-06, -2.3842e-07, 1.0729e-06, 8.3447e-07, 4.7684e-07,...]

danpovey commented 3 years ago

That must be prior to the softmax. Can you get it after the softmax?

On Sun, Jan 10, 2021 at 12:26 PM Han Zhu notifications@github.com wrote:

Given the same input, the PyTorch CTC gradient sum per frame is: [ 0.0000e+00, 2.3842e-07, -3.5763e-07, -2.3842e-07, -3.5763e-07,...] and the k2 CTC gradient sum per frame is: [-1.1921e-06, -2.3842e-07, 1.0729e-06, 8.3447e-07, 4.7684e-07,...]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-757412964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6GAEBOJEFYOVVG5V3SZET7FANCNFSM4V2H7NEQ .

zhu-han commented 3 years ago

Those was already after softmax results. For example, the torch gradient for one frame is:

[ -9.4860,   2.4738,   5.9179,   4.7736,   5.5900,   2.8961,   6.4206,
            4.4688,   2.8942,   4.0882, -74.9657,   4.3691,   5.7488,   6.3485,
            6.4876,   2.9647,   3.2492,   4.7775,   3.5132,   2.7532,   4.7165]

It's sum is 5.2452e-06.

k2 gradient of this same frame is:

[ -9.4859,   2.4738,   5.9179,   4.7736,   5.5900,   2.8961,   6.4206,
            4.4688,   2.8942,   4.0882, -74.9657,   4.3691,   5.7488,   6.3485,
            6.4876,   2.9647,   3.2492,   4.7775,   3.5132,   2.7532,   4.7165]

And it's sum is -8.5831e-06

These two gradients only have one different value: -9.4860 vs -9.4859 in the first dimension.

danpovey commented 3 years ago

doesnt look right. .gradient after.softmax should sum to one, is.equal to posterior.

On Sunday, January 10, 2021, Han Zhu notifications@github.com wrote:

It was after softmax result. for example, the torch gradient for one frame is:

[ -9.4860, 2.4738, 5.9179, 4.7736, 5.5900, 2.8961, 6.4206, 4.4688, 2.8942, 4.0882, -74.9657, 4.3691, 5.7488, 6.3485, 6.4876, 2.9647, 3.2492, 4.7775, 3.5132, 2.7532, 4.7165]

It's sum is 5.2452e-06.

k2 gradient of this same frame is:

[ -9.4859, 2.4738, 5.9179, 4.7736, 5.5900, 2.8961, 6.4206, 4.4688, 2.8942, 4.0882, -74.9657, 4.3691, 5.7488, 6.3485, 6.4876, 2.9647, 3.2492, 4.7775, 3.5132, 2.7532, 4.7165]

And it's sum is -8.5831e-06

These two gradients only have one different value: -9.4860 vs -9.4859 in the first dimension

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-757415183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY2WDPP5WBQWAX7H7LSZEWPPANCNFSM4V2H7NEQ .

zhu-han commented 3 years ago

Oh, I misunderstand that ,I thought you mean the loss was computed prior to the softmax. I will update results.

zhu-han commented 3 years ago

When I set learning rate as 1 and use k2 CTC loss, the gradient sum per frame of the tensor after logsoftmax is -1. I'm not sure it is what you want to check.

danpovey commented 3 years ago

Yes that sounds right. See if the same is true of PyTorch's one; the error could be there.

zhu-han commented 3 years ago

For PyTorch, these values is near to 0, i.e.,[-4.7088e-6, -4.6492e-6, ...]

danpovey commented 3 years ago

Ah, I guess it does the normalization internally. It's unlikely, IMO, that there is a roundoff problem in k2, given what you say. More likely in pytorch itself and the WER differences may be tuning-dependent, most likely.

csukuangfj commented 3 years ago

For the simplest case,

#             blk    a    b    c    d
activation = [0.2, 0.2, 0.2, 0.2, 0.2]
log_probs = log_softmax of activation
log_probs.retain_grad()

And if the target label is a,

for k2, log_probs.grad is [0, -1, 0, 0, 0]. log_probs.grad.sum() is -1
for PyTorch, log_probs.grad is [0.2, -0.8, 0.2, 0.2, 0.2]. log_probs.grad.sum() is 0

danpovey commented 3 years ago

PyTorch is obviously doing the log-softmax normalization as part of the CTC computation; in k2 those things are separate.

danpovey commented 3 years ago

Do we know of any difference in speed?

csukuangfj commented 3 years ago

We (@brianyan918) are also working on comparing pytorch CTC, warpCTC, k2 CTC, and gtn CTC.

@sw005320 Could you share the progress with us? Does the comparison include speed differences?

brianyan918 commented 3 years ago

I tested these different CTC modes in espnet with these results on voxforge italian eval:

Model	CER	WER
Conformer (warpctc)	8.5	30.0
Conformer (pytorch)	8.6	30.6
Conformer (gtnctc)	8.5	30.0
Conformer (k2)	8.7	30.8

Previously I was able to compare the speeds of pytorch vs warp vs gtn, but for k2 I used a different device. I'll provide an update with speed comparisons shortly.

zhu-han commented 3 years ago

When training on librispeech 100h for one epoch, the results are:

Method	Time
PyTorch	15.69 min
k2	17.78 min

danpovey commented 3 years ago

OK thanks. Was that in debug or release mode? (It can be quite different). In debug mode, there is a speed boost from doing export K2_DISABLE_CHECKS=1 prior to running it. We have a lot of checking code active by default right not.

On Thu, Jan 14, 2021 at 2:17 PM Han Zhu notifications@github.com wrote:

When training on librispeech 100h for one epoch, the results are: Method Time PyTorch 15.69 min k2 17.78 min

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-759952461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6WQSNO2HU5B2YPRKDSZ2D6VANCNFSM4V2H7NEQ .

zhu-han commented 3 years ago

I followed https://k2.readthedocs.io/en/latest/installation.html#install-k2-from-source to install k2. Is this in release mode by default?

csukuangfj commented 3 years ago

cmake -DCMAKE_BUILD_TYPE=Release ..

If you followed it step by step, then it is a Release build.

zhu-han commented 3 years ago

Yes, it is in release mode then.

csukuangfj commented 3 years ago

python3 -m k2.version

should tell you whether k2 was built in Release mode or in Debug mode.

zhu-han commented 3 years ago

It shows Build type: Release.

danpovey commented 3 years ago

OK. When was the code pulled? there may have been speed improvements.

On Thu, Jan 14, 2021 at 3:54 PM Han Zhu notifications@github.com wrote:

It shows Build type: Release.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-759996569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6536EOWLFSN7BVCC3SZ2PLFANCNFSM4V2H7NEQ .

zhu-han commented 3 years ago

Pulled on 2021/01/06.

danpovey commented 3 years ago

OK, probably no speed optimizations since then.

On Thu, Jan 14, 2021 at 4:10 PM Han Zhu notifications@github.com wrote:

Pulled on 2021/01/06.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-760007173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO22D7SFPE63U2BYDH3SZ2RG3ANCNFSM4V2H7NEQ .

csukuangfj commented 3 years ago

This pull-request https://github.com/k2-fsa/k2/pull/571#issuecomment-755888081, merged on Jan 8, made GetTransposeReodering 2-3x faster than before. Not sure how it would affect the training speed.

zhu-han commented 3 years ago

Tried with the latest k2, training time is similar. Previous training time is 17.78 min and the latest one is 17.68 min.

k2-fsa / k2

Comparison between k2 CTC loss and PyTorch CTC loss #575