Closed zhu-han closed 3 years ago
That's interesting. It's possible that it could be a bug in k2, but there are many places it could be. I checked the documentation for torch.nn.CTCLoss but it is a little vague so it's hard to know whether they are attempting to implement the same thing as us. One thing you could do which would be helpful to us is to try to evaluate k2's version of the loss given the model trained with PyTorch. If it looks similar to PyTorch's loss, it would likely indicate a bug in computing derivatives.
Also it would be nice if someone could compute the sum of the derivative (.grad) of our CTC loss and make sure the sum on each frame of each sequence is close to 1.0. [if we can somehow access the .grad w.r.t. the nnet output].
Using random initialized transformer model, the first 10 iteration loss computed with K2CTCLoss and torch.nn.CTCLoss are: iteration | K2CTCLoss | torch.nn.CTCLoss |
---|---|---|
1 | 3379.74 | 3385.51 |
2 | 2644.70 | 2643.39 |
3 | 2760.64 | 2765.41 |
4 | 2593.41 | 2595.71 |
5 | 2360.99 | 2363.25 |
6 | 2351.16 | 2346.32 |
7 | 3471.35 | 3478.80 |
8 | 2540.69 | 2540.17 |
9 | 2953.67 | 2955.29 |
10 | 2190.49 | 2189.26 |
Using transformer model trained with torch.nn.CTCLoss for 10 epoch, the first 10 iteration loss computed with K2CTCLoss and torch.nn.CTCLoss are:
iteration | K2CTCLoss | torch.nn.CTCLoss |
---|---|---|
1 | 862.52 | 35.42 |
2 | 782.07 | 35.21 |
3 | 870.66 | 39.65 |
4 | 804.01 | 27.49 |
5 | 821.23 | 37.25 |
6 | 806.56 | 36.18 |
7 | 717.92 | 31.09 |
8 | 705.32 | 32.66 |
9 | 829.99 | 28.65 |
10 | 749.98 | 27.37 |
It seems that we can get a similar loss value with random initialized model but not a pretrained model.
Thanks a lot!! For the transformer model, can you clarify how you were training it? Was it with one of the two CTC losses?
And in the 2nd table, can you clarify if you were training with the same loss functions you were evaluating? What I want is for you to train with one loss and also evaluate the objectives with the other. To see if the actual loss calculation is the same (might be bug in derivative computation)
In the 2nd table, the pretrained transformer model is trained with torch.nn.CTCLoss only. And then the training and loss calculation used the same loss function.
OK, but what I want is for you to train with the torch loss and evaluate with k2 CTC loss, with the same model. So same code will evaluate 2 objectives.
With the random transformer model, what are the iterations? That is, what objective are you training with?
Sorry for the misunderstanding. When trained with torch.nn.CTCLoss and also evaluate the K2CTCLoss, no matter with a random initialized or a pretrained model, the two loss value is the same.
In the 1st table above (random transformer model results), the two column are trained with K2CTCLoss and torch.nn.CTCLoss as objective respectively.
So if you train with the PyTorch loss and evaluate also with the k2 one, you'll get the same value? Because in iteration 1 of your 2nd table, they're very different... if you showed iteration 0, would the k2 one be the same?
I checked the code and find a bug which accidently make the two loss function the same.
The real results is : with a random initialized model, the two losses are similar. With a pretrained model, the two losses are very different. I will paste the results bellow.
The training objective is torch.nn.CTCLoss and the evaluation is performed on both K2CTCLoss and torch.nn.CTCLoss
iteration | K2CTCLoss | torch.nn.CTCLoss |
---|---|---|
1 | 3379.74 | 3385.51 |
2 | 2644.70 | 2643.39 |
3 | 2760.65 | 2765.41 |
4 | 2593.45 | 2595.71 |
5 | 2361.09 | 2363.25 |
6 | 2351.38 | 2346.32 |
7 | 3471.62 | 3478.80 |
8 | 2540.69 | 2540.17 |
9 | 2953.91 | 2955.29 |
10 | 2191.66 | 2189.26 |
iteration | K2CTCLoss | torch.nn.CTCLoss |
---|---|---|
1 | 862.52 | 35.42 |
2 | 782.09 | 35.21 |
3 | 870.74 | 39.65 |
4 | 804.16 | 27.49 |
5 | 821.47 | 37.25 |
6 | 806.93 | 36.18 |
7 | 718.35 | 31.09 |
8 | 705.85 | 32.66 |
9 | 830.93 | 28.65 |
10 | 750.99 | 27.37 |
OK. WIthout seeing the code it will be hard to comment much further or help debug. Fanjun says he will try to debug the derivatives of the k2 loss over the weekend.
Thanks a lot for your help! If anyone is interested, my K2CTCLoss implementation is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py.
[re-posting directly, mail is unreliable.] You are not using 'indices' to sort the FSAs in the graphs. I'm not sure if our Fsa object has an operator [] that can take a Tensor, but it might.
Basically, your graphs are in the wrong order.
You could also possibly reorder targets
and target_lengths
before compiling the graph.
You are not using 'indices' to sort the FSAs in the graphs. I'm not sure if our Fsa object has an operator [] that can take a Tensor, but it might.
Basically, your graphs are in the wrong order.
You could also possibly reorder targets
and target_lengths
before
compiling the graph.
On Fri, Jan 8, 2021 at 10:19 PM Han Zhu notifications@github.com wrote:
Thanks a lot for your help! If anyone is intrested, my K2CTCLoss implementation is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-756778842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6YXZHFVCXH35WNLXLSY4H7JANCNFSM4V2H7NEQ .
possibly
decoding_graph = k2.index(decoding_graph, indices)
would work (not sure though)
Thanks for your help! I will change my code accordingly and do the experiments.
@zhu-han, thanks for sharing your interesting report. I will also take a look at this. We (@brianyan918) are also working on comparing pytorch CTC, warpCTC, k2 CTC, and gtn CTC.
After fixing the graph order issue, K2CTCLoss could work with Transformer now. With bpe 500 as CTC modeling unit, the loss curve is like:
And previous results make sense now. Before making batch, the training samples is sorted according to input length. So with a smaller batch size, all samples in the same batch is more likely to have the same length. With the same length, the sorted text could match the order of unsorted graph. In my experiments, BLSTM has smaller batch size than Transformers (20 vs 256), so BLSTM suffers less than Transformer because of this bug. That's why BLSTM could work and Transformer could not work in previous results.
Thanks a lot!
@sw005320 My revised K2CTCLoss is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py. I will be glad to help on this.
I just added gradients test for k2 CTC loss. Please see https://github.com/k2-fsa/k2/pull/577
It shows that k2 CTC loss is identical to PyTorch CTC loss and warp-ctc when they are given the same input.
The gradients of k2 and PyTorch are also the same.
Thanks! But since I find models trained with k2 CTC loss and PyTorch CTC loss did have some differences, I added additional test cases baed on test_random_case1
in ctc_gradients_test.py
to check it. Here are some results:
T
and C
to match my experiment's setup, i.e., T = 400
(16s training sample and 4 × subsampling factor), C = 5000
(with bpe 5000 as CTC modeling unit), this test case is failed. Specifically ,the gradient check assert torch.allclose(torch_activation.grad, k2_activation.grad, atol=1e-2)
is failed.T
as orignial and only change C
to 5000, the gradient check is passed. But when I keep C
and change the sample length T
to 400, and the gradient check is also failed. It seems with longer samples, the difference is larger.
And these are the results I got on librispeech 100h using PyTorch CTC loss and k2 CTC loss:
Criterion | Test clean | Test other |
---|---|---|
CTC | 17.1 | 35.9 |
Hybrid CTC/Attention | 10.3 | 27.1 |
Criterion | Test clean | Test other |
---|---|---|
CTC | 17.3 | 36.4 |
Hybrid CTC/Attention | 10.6 | 27.5 |
Detailed setup:
k2.intersect_dense()
, set output_beam = 10.0
Cool! Regarding the gradient-check: sometimes there can be roundoff error that causes the posteriors on some frames to sum to a number different than 1. Can you compute those sums? I.e. the sum of the grad, per frame...
Given the same input, the PyTorch CTC gradient sum per frame is:
[ 0.0000e+00, 2.3842e-07, -3.5763e-07, -2.3842e-07, -3.5763e-07,...]
and the k2 CTC gradient sum per frame is:
[-1.1921e-06, -2.3842e-07, 1.0729e-06, 8.3447e-07, 4.7684e-07,...]
That must be prior to the softmax. Can you get it after the softmax?
On Sun, Jan 10, 2021 at 12:26 PM Han Zhu notifications@github.com wrote:
Given the same input, the PyTorch CTC gradient sum per frame is: [ 0.0000e+00, 2.3842e-07, -3.5763e-07, -2.3842e-07, -3.5763e-07,...] and the k2 CTC gradient sum per frame is: [-1.1921e-06, -2.3842e-07, 1.0729e-06, 8.3447e-07, 4.7684e-07,...]
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-757412964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6GAEBOJEFYOVVG5V3SZET7FANCNFSM4V2H7NEQ .
Those was already after softmax results. For example, the torch gradient for one frame is:
[ -9.4860, 2.4738, 5.9179, 4.7736, 5.5900, 2.8961, 6.4206,
4.4688, 2.8942, 4.0882, -74.9657, 4.3691, 5.7488, 6.3485,
6.4876, 2.9647, 3.2492, 4.7775, 3.5132, 2.7532, 4.7165]
It's sum is 5.2452e-06.
k2 gradient of this same frame is:
[ -9.4859, 2.4738, 5.9179, 4.7736, 5.5900, 2.8961, 6.4206,
4.4688, 2.8942, 4.0882, -74.9657, 4.3691, 5.7488, 6.3485,
6.4876, 2.9647, 3.2492, 4.7775, 3.5132, 2.7532, 4.7165]
And it's sum is -8.5831e-06
These two gradients only have one different value: -9.4860 vs -9.4859 in the first dimension.
doesnt look right. .gradient after.softmax should sum to one, is.equal to posterior.
On Sunday, January 10, 2021, Han Zhu notifications@github.com wrote:
It was after softmax result. for example, the torch gradient for one frame is:
[ -9.4860, 2.4738, 5.9179, 4.7736, 5.5900, 2.8961, 6.4206, 4.4688, 2.8942, 4.0882, -74.9657, 4.3691, 5.7488, 6.3485, 6.4876, 2.9647, 3.2492, 4.7775, 3.5132, 2.7532, 4.7165]
It's sum is 5.2452e-06.
k2 gradient of this same frame is:
[ -9.4859, 2.4738, 5.9179, 4.7736, 5.5900, 2.8961, 6.4206, 4.4688, 2.8942, 4.0882, -74.9657, 4.3691, 5.7488, 6.3485, 6.4876, 2.9647, 3.2492, 4.7775, 3.5132, 2.7532, 4.7165]
And it's sum is -8.5831e-06
These two gradients only have one different value: -9.4860 vs -9.4859 in the first dimension
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-757415183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY2WDPP5WBQWAX7H7LSZEWPPANCNFSM4V2H7NEQ .
Oh, I misunderstand that ,I thought you mean the loss was computed prior to the softmax. I will update results.
When I set learning rate as 1 and use k2 CTC loss, the gradient sum per frame of the tensor after logsoftmax is -1. I'm not sure it is what you want to check.
Yes that sounds right. See if the same is true of PyTorch's one; the error could be there.
For PyTorch, these values is near to 0, i.e.,[-4.7088e-6, -4.6492e-6, ...]
Ah, I guess it does the normalization internally. It's unlikely, IMO, that there is a roundoff problem in k2, given what you say. More likely in pytorch itself and the WER differences may be tuning-dependent, most likely.
For the simplest case,
# blk a b c d
activation = [0.2, 0.2, 0.2, 0.2, 0.2]
log_probs = log_softmax of activation
log_probs.retain_grad()
And if the target label is a
,
log_probs.grad
is [0, -1, 0, 0, 0]
. log_probs.grad.sum()
is -1log_probs.grad
is [0.2, -0.8, 0.2, 0.2, 0.2]
. log_probs.grad.sum()
is 0PyTorch is obviously doing the log-softmax normalization as part of the CTC computation; in k2 those things are separate.
Do we know of any difference in speed?
We (@brianyan918) are also working on comparing pytorch CTC, warpCTC, k2 CTC, and gtn CTC.
@sw005320 Could you share the progress with us? Does the comparison include speed differences?
I tested these different CTC modes in espnet with these results on voxforge italian eval:
Model | CER | WER |
---|---|---|
Conformer (warpctc) | 8.5 | 30.0 |
Conformer (pytorch) | 8.6 | 30.6 |
Conformer (gtnctc) | 8.5 | 30.0 |
Conformer (k2) | 8.7 | 30.8 |
Previously I was able to compare the speeds of pytorch vs warp vs gtn, but for k2 I used a different device. I'll provide an update with speed comparisons shortly.
When training on librispeech 100h for one epoch, the results are:
Method | Time |
---|---|
PyTorch | 15.69 min |
k2 | 17.78 min |
OK thanks. Was that in debug or release mode? (It can be quite different). In debug mode, there is a speed boost from doing export K2_DISABLE_CHECKS=1 prior to running it. We have a lot of checking code active by default right not.
On Thu, Jan 14, 2021 at 2:17 PM Han Zhu notifications@github.com wrote:
When training on librispeech 100h for one epoch, the results are: Method Time PyTorch 15.69 min k2 17.78 min
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-759952461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6WQSNO2HU5B2YPRKDSZ2D6VANCNFSM4V2H7NEQ .
I followed https://k2.readthedocs.io/en/latest/installation.html#install-k2-from-source to install k2. Is this in release mode by default?
cmake -DCMAKE_BUILD_TYPE=Release ..
If you followed it step by step, then it is a Release
build.
Yes, it is in release mode then.
python3 -m k2.version
should tell you whether k2
was built in Release mode or in Debug mode.
It shows Build type: Release
.
OK. When was the code pulled? there may have been speed improvements.
On Thu, Jan 14, 2021 at 3:54 PM Han Zhu notifications@github.com wrote:
It shows Build type: Release.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-759996569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6536EOWLFSN7BVCC3SZ2PLFANCNFSM4V2H7NEQ .
Pulled on 2021/01/06.
OK, probably no speed optimizations since then.
On Thu, Jan 14, 2021 at 4:10 PM Han Zhu notifications@github.com wrote:
Pulled on 2021/01/06.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/575#issuecomment-760007173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO22D7SFPE63U2BYDH3SZ2RG3ANCNFSM4V2H7NEQ .
This pull-request https://github.com/k2-fsa/k2/pull/571#issuecomment-755888081, merged on Jan 8, made
GetTransposeReodering
2-3x faster than before. Not sure how it would affect the training speed.
Tried with the latest k2, training time is similar. Previous training time is 17.78 min and the latest one is 17.68 min.
Has anyone compaired the performance between k2 CTC loss implementation and the CTCLoss in PyTorch?
I tried to write a K2CTCLoss with k2 to replace torch.nn.CTCLoss and did some experiments using ESPnet. It shows there is a gap between K2CTCLoss and torch.nn.CTCLoss.
The experiments are conducted on Librispeech 100h and the training criterion is CTC only. Acoustic model is BLSTM or Transformer based encoder. For CTC modeling unit, I tried char and bpe 5000. Here are some conclusions of my experiments:
K2CTCLoss could work with BLSTM based acoustic model, though torch.nn.CTCLoss could reduce the loss faster;
K2CTCLoss did't work with Transformer. When using bpe 5000 as CTC modeling unit, the loss curve of K2CTCLoss would be like : In comparison, torch.nn.CTCLoss with Transformer is like:
The above conclusions are the same when CTC modeling unit is char or bpe 5000.
In snowfall, the CTC implementation is (1) acoustic feature->phone->word. I did a experiment using the K2CTCLoss with (2) acoustic feature->char structure. And the WERs are (1) 12.84% and (2) 15.99% respectively. So I think the K2CTCLoss implementation should be fine.
Could anyone give me some advice on how to make it work better? And does anyone know why it can't work well with transformer? Thanks!