k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
910 stars 291 forks source link

inf pruned_loss and high simple_loss in middle of training #569

Open is2022 opened 2 years ago

is2022 commented 2 years ago

I'm training an asr model using conv_emformer_transducer_stateless2 with my own data. In the middle of training I get the following error: simple_loss: 169.94381713867188 pruned_loss: inf Here are some info about the batch that caused the problem: features shape: torch.Size([9, 100, 80]) num tokens: 9 'num_frames': tensor([100, 100, 100, 100, 100, 100, 100, 100, 100] Could you please help me figure out what is causing this?

danpovey commented 2 years ago

It would be helpful to show some minibatches up to that point, i.e. some log or screen output preceding it.

is2022 commented 2 years ago

Here is the log before the error:

`2022-09-13 00:21:30,784 INFO [train_sh.py:890] Epoch 4, batch 8350, loss[loss=0.3493, simple_loss=0.3172, pruned_loss=0.1907, over 1959.00 frames. utt_duration=872 frames, utt_pad_proportion=0.06438, over 9.00 utterances.], tot_loss[loss=0.3648, simple_loss=0.3565, pruned_loss=0.1865, over 388576.36 frames. utt_duration=865.1 frames, utt_pad_proportion=0.03844, over 1800.00 utterances.], batch size: 9, lr: 6.16e-04

2022-09-13 00:21:48,620 INFO [train_sh.py:890] Epoch 4, batch 8400, loss[loss=0.2783, simple_loss=0.3049, pruned_loss=0.1259, over 1993.00 frames. utt_duration=887.1 frames, utt_pad_proportion=0.04096, over 9.00 utterances.], tot_loss[loss=0.3644, simple_loss=0.3562, pruned_loss=0.1863, over 391155.60 frames. utt_duration=870.9 frames, utt_pad_proportion=0.03962, over 1800.00 utterances.], batch size: 9, lr: 6.16e-04

2022-09-13 00:22:06,588 INFO [train_sh.py:890] Epoch 4, batch 8450, loss[loss=0.3974, simple_loss=0.3704, pruned_loss=0.2122, over 2035.00 frames. utt_duration=906.1 frames, utt_pad_proportion=0.04519, over 9.00 utterances.], tot_loss[loss=0.3623, simple_loss=0.3535, pruned_loss=0.1856, over 394041.68 frames. utt_duration=877.3 frames, utt_pad_proportion=0.04084, over 1800.00 utterances.], batch size: 9, lr: 6.16e-04

2022-09-13 00:22:25,043 INFO [train_sh.py:890] Epoch 4, batch 8500, loss[loss=0.3629, simple_loss=0.3667, pruned_loss=0.1795, over 2020.00 frames. utt_duration=899 frames, utt_pad_proportion=0.07128, over 9.00 utterances.], tot_loss[loss=0.3597, simple_loss=0.352, pruned_loss=0.1838, over 396593.07 frames. utt_duration=882.9 frames, utt_pad_proportion=0.04316, over 1800.00 utterances.], batch size: 9, lr: 6.16e-04

2022-09-13 00:22:43,355 INFO [train_sh.py:890] Epoch 4, batch 8550, loss[loss=0.3937, simple_loss=0.368, pruned_loss=0.2097, over 2027.00 frames. utt_duration=902.9 frames, utt_pad_proportion=0.0768, over 9.00 utterances.], tot_loss[loss=0.3544, simple_loss=0.3483, pruned_loss=0.1802, over 399820.99 frames. utt_duration=890.1 frames, utt_pad_proportion=0.04512, over 1800.00 utterances.], batch size: 9, lr: 6.16e-04

2022-09-13 00:23:08,916 INFO [train_sh.py:890] Epoch 4, batch 8600, loss[loss=0.384, simple_loss=0.3598, pruned_loss=0.2041, over 2102.00 frames. utt_duration=935.6 frames, utt_pad_proportion=0.05308, over 9.00 utterances.], tot_loss[loss=0.3573, simple_loss=0.3504, pruned_loss=0.1821, over 400936.57 frames. utt_duration=892.6 frames, utt_pad_proportion=0.04789, over 1800.00 utterances.], batch size: 9, lr: 6.16e-04 `

is2022 commented 2 years ago

Any other info is needed?

csukuangfj commented 2 years ago

I'm training an asr model using conv_emformer_transducer_stateless2 with my own data.

In the middle of training I get the following error:

**simple_loss: 169.94381713867188

pruned_loss: inf**

Here are some info about the batch that caused the problem:

**features shape: torch.Size([9, 100, 80])

num tokens: 9

'num_frames': tensor([100, 100, 100, 100, 100, 100, 100, 100, 100]**

Could you please help me figure out what is causing this?

Your batch size is 9 while your number of tokens is also 9, which means every utterance has only 1 token in average.

Could you print out the tokens of that batch?

is2022 commented 2 years ago

@csukuangfj Is this what you mean: (Under supervisions) 'text': ['yeah', 'yeah', 'right', 'yes', 'yes', 'okay', 'mhm', 'really', 'yeah']

csukuangfj commented 2 years ago

@csukuangfj Is this what you mean: (Under supervisions) 'text': ['yeah', 'yeah', 'right', 'yes', 'yes', 'okay', 'mhm', 'really', 'yeah']

Are you using the latest k2?

There is a similar issue https://github.com/danpovey/fast_rnnt/issues/10#issuecomment-1176999921 but it has been fixed in the latest master of k2.

is2022 commented 2 years ago

My version is: 'env_info': {'k2-version': '1.17', ..., 'k2-git-date': 'Mon Jul 25 02:11:54 2022', ...} so I need to update my k2 to get the fix you mentioned, right?

csukuangfj commented 2 years ago

Yes, I think so.

is2022 commented 2 years ago

@csukuangfj Thank you very much the fix that you proposed seems to work.

I still see a peculiar behaviour while training. The loss starts at .8 goes down to 0.5, but suddenly jumps from around .5 to 1.2 and doesn't seem to come down (I'm in middle of epoch 6 now). Here are parts of the log (running on 4 gpus):

2022-10-03 21:05:24,891 INFO [train.py:908] (1/4) Epoch 1, batch 5950, loss[loss=0.4641, simple_loss=0.7553, pruned_loss=0.8645, over 2402.00 frames. utt_duration=1069 frames, utt_pad_proportion=0.04948, over 9.00 utterances.], tot_loss[loss=0.4736, simple_loss=0.7719, pruned_loss=0.8763, over 468996.04 frames. utt_duration=1044 frames, utt_pad_proportion=0.05271, over 1800.00 utterances.], batch size: 9, lr: 2.41e-03 2022-10-03 21:05:24,896 INFO [train.py:908] (0/4) Epoch 1, batch 5950, loss[loss=0.4962, simple_loss=0.8107, pruned_loss=0.909, over 2408.00 frames. utt_duration=1071 frames, utt_pad_proportion=0.04855, over 9.00 utterances.], tot_loss[loss=0.4729, simple_loss=0.7707, pruned_loss=0.8749, over 469162.56 frames. utt_duration=1044 frames, utt_pad_proportion=0.05162, over 1800.00 utterances.], batch size: 9, lr: 2.41e-03 2022-10-03 21:05:24,909 INFO [train.py:908] (3/4) Epoch 1, batch 5950, loss[loss=0.4347, simple_loss=0.7102, pruned_loss=0.7959, over 2288.00 frames. utt_duration=1018 frames, utt_pad_proportion=0.07863, over 9.00 utterances.], tot_loss[loss=0.4738, simple_loss=0.7721, pruned_loss=0.8775, over 469333.97 frames. utt_duration=1044 frames, utt_pad_proportion=0.05211, over 1800.00 utterances.], batch size: 9, lr: 2.41e-03 2022-10-03 21:05:24,936 INFO [train.py:908] (2/4) Epoch 1, batch 5950, loss[loss=0.5209, simple_loss=0.8476, pruned_loss=0.9707, over 2461.00 frames. utt_duration=1095 frames, utt_pad_proportion=0.0258, over 9.00 utterances.], tot_loss[loss=0.4745, simple_loss=0.7733, pruned_loss=0.8782, over 470359.41 frames. utt_duration=1047 frames, utt_pad_proportion=0.0498, over 1800.00 utterances.], batch size: 9, lr: 2.41e-03 2022-10-03 21:05:45,267 INFO [train.py:908] (1/4) Epoch 1, batch 6000, loss[loss=1.251, simple_loss=0.7646, pruned_loss=0.8683, over 2366.00 frames. utt_duration=1052 frames, utt_pad_proportion=0.05386, over 9.00 utterances.], tot_loss[loss=0.4789, simple_loss=0.7681, pruned_loss=0.8727, over 471619.94 frames. utt_duration=1050 frames, utt_pad_proportion=0.05169, over 1800.00 utterances.], batch size: 9, lr: 2.40e-03 2022-10-03 21:05:45,276 INFO [train.py:908] (0/4) Epoch 1, batch 6000, loss[loss=1.171, simple_loss=0.7065, pruned_loss=0.8179, over 2402.00 frames. utt_duration=1069 frames, utt_pad_proportion=0.05472, over 9.00 utterances.], tot_loss[loss=0.4827, simple_loss=0.7735, pruned_loss=0.8787, over 471513.11 frames. utt_duration=1049 frames, utt_pad_proportion=0.05129, over 1800.00 utterances.], batch size: 9, lr: 2.40e-03 2022-10-03 21:05:45,281 INFO [train.py:908] (2/4) Epoch 1, batch 6000, loss[loss=1.362, simple_loss=0.8276, pruned_loss=0.9481, over 2404.00 frames. utt_duration=1070 frames, utt_pad_proportion=0.05226, over 9.00 utterances.], tot_loss[loss=0.4817, simple_loss=0.772, pruned_loss=0.877, over 472468.66 frames. utt_duration=1051 frames, utt_pad_proportion=0.0496, over 1800.00 utterances.], batch size: 9, lr: 2.40e-03 2022-10-03 21:05:45,282 INFO [train.py:908] (3/4) Epoch 1, batch 6000, loss[loss=1.337, simple_loss=0.8068, pruned_loss=0.9335, over 2345.00 frames. utt_duration=1043 frames, utt_pad_proportion=0.07259, over 9.00 utterances.], tot_loss[loss=0.4818, simple_loss=0.772, pruned_loss=0.8775, over 471883.13 frames. utt_duration=1050 frames, utt_pad_proportion=0.05118, over 1800.00 utterances.], batch size: 9, lr: 2.40e-03 ... 2022-10-04 14:07:27,325 INFO [train.py:908] (1/4) Epoch 6, batch 22900, loss[loss=1.173, simple_loss=0.7044, pruned_loss=0.8209, over 3762.00 frames. utt_duration=1673 frames, utt_pad_proportion=0.02726, over 9.00 utterances.], tot_loss[loss=1.205, simple_loss=0.732, pruned_loss=0.8392, over 735485.36 frames. utt_duration=1636 frames, utt_pad_proportion=0.0225, over 1800.00 utterances.], batch size: 9, lr: 4.97e-04 2022-10-04 14:07:27,327 INFO [train.py:908] (0/4) Epoch 6, batch 22900, loss[loss=1.262, simple_loss=0.7564, pruned_loss=0.8837, over 3712.00 frames. utt_duration=1651 frames, utt_pad_proportion=0.02423, over 9.00 utterances.], tot_loss[loss=1.203, simple_loss=0.7312, pruned_loss=0.8379, over 735500.68 frames. utt_duration=1636 frames, utt_pad_proportion=0.02201, over 1800.00 utterances.], batch size: 9, lr: 4.97e-04 2022-10-04 14:07:27,332 INFO [train.py:908] (2/4) Epoch 6, batch 22900, loss[loss=1.278, simple_loss=0.7793, pruned_loss=0.8886, over 3742.00 frames. utt_duration=1665 frames, utt_pad_proportion=0.02366, over 9.00 utterances.], tot_loss[loss=1.195, simple_loss=0.7261, pruned_loss=0.8319, over 735215.91 frames. utt_duration=1635 frames, utt_pad_proportion=0.02364, over 1800.00 utterances.], batch size: 9, lr: 4.97e-04 2022-10-04 14:07:27,337 INFO [train.py:908] (3/4) Epoch 6, batch 22900, loss[loss=1.158, simple_loss=0.7024, pruned_loss=0.8067, over 3694.00 frames. utt_duration=1643 frames, utt_pad_proportion=0.0294, over 9.00 utterances.], tot_loss[loss=1.195, simple_loss=0.7264, pruned_loss=0.8322, over 735178.99 frames. utt_duration=1635 frames, utt_pad_proportion=0.02172, over 1800.00 utterances.], batch size: 9, lr: 4.97e-04

danpovey commented 2 years ago

The jump is likely from when we add in the pruned loss to the loss function. If the simple_loss never goes below around 0.5, then it has failed to learn the data alignments. there is no point training beyond about 5000 batches if this is the case. You could try reducing the learning rate a bit. In future our recipes will be more robust in this respect.

is2022 commented 1 year ago

@csukuangfj Unfortunately, I'm still getting the inf loss, after a few epochs. (this time epoch 3!) Here is the log:

2022-10-23 14:53:29,352 INFO [train.py:907] (0/8) Epoch 3, batch 115350, loss[loss=0.2266, simple_loss=0.245, pruned_loss=0.1042, over 493.00 frames. utt_duration=220.1 frames, utt_pad_proportion=0.3638, over 9.00 utterances.], tot_loss[loss=0.2179, simple_loss=0.2714, pruned_loss=0.08219, over 177822.05 frames. utt_duration=396.7 frames, utt_pad_proportion=0.3776, over 1800.00 utterances.], batch size: 9, lr: 3.41e-04 2022-10-23 14:53:29,354 INFO [train.py:907] (6/8) Epoch 3, batch 115350, loss[loss=0.2765, simple_loss=0.2746, pruned_loss=0.1391, over 662.00 frames. utt_duration=295.6 frames, utt_pad_proportion=0.4527, over 9.00 utterances.], tot_loss[loss=0.2232, simple_loss=0.2758, pruned_loss=0.08527, over 189134.61 frames. utt_duration=421.8 frames, utt_pad_proportion=0.3614, over 1800.00 utterances.], batch size: 9, lr: 3.41e-04 2022-10-23 14:53:29,354 INFO [train.py:907] (4/8) Epoch 3, batch 115350, loss[loss=0.2727, simple_loss=0.2978, pruned_loss=0.1238, over 421.00 frames. utt_duration=189.1 frames, utt_pad_proportion=0.2944, over 9.00 utterances.], tot_loss[loss=0.2236, simple_loss=0.2764, pruned_loss=0.08547, over 179088.71 frames. utt_duration=399.5 frames, utt_pad_proportion=0.3786, over 1800.00 utterances.], batch size: 9, lr: 3.41e-04 2022-10-23 14:53:29,373 INFO [train.py:907] (1/8) Epoch 3, batch 115350, loss[loss=0.2841, simple_loss=0.3477, pruned_loss=0.1103, over 2988.00 frames. utt_duration=1330 frames, utt_pad_proportion=0.01115, over 9.00 utterances.], tot_loss[loss=0.22, simple_loss=0.2727, pruned_loss=0.08362, over 182204.09 frames. utt_duration=406.4 frames, utt_pad_proportion=0.3742, over 1800.00 utterances.], batch size: 9, lr: 3.41e-04 2022-10-23 14:53:41,095 INFO [train.py:715] (0/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,095 INFO [utils.py:968] (0/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,097 INFO [utils.py:974] (0/8) features shape: torch.Size([9, 422, 80]) 2022-10-23 14:53:41,098 INFO [utils.py:978] (0/8) num tokens: 88 2022-10-23 14:53:41,100 INFO [train.py:715] (3/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,100 INFO [utils.py:968] (3/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,104 INFO [train.py:715] (2/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,104 INFO [utils.py:968] (2/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,104 INFO [utils.py:974] (3/8) features shape: torch.Size([9, 497, 80]) 2022-10-23 14:53:41,105 INFO [utils.py:978] (3/8) num tokens: 88 2022-10-23 14:53:41,119 INFO [train.py:715] (6/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,119 INFO [utils.py:968] (6/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,126 INFO [train.py:715] (5/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,126 INFO [utils.py:968] (5/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,146 INFO [train.py:715] (1/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,147 INFO [utils.py:968] (1/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,155 INFO [train.py:715] (4/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,156 INFO [utils.py:968] (4/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,209 INFO [train.py:715] (7/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-23 14:53:41,209 INFO [utils.py:968] (7/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt 2022-10-23 14:53:41,558 INFO [utils.py:974] (2/8) features shape: torch.Size([9, 547, 80]) 2022-10-23 14:53:41,559 INFO [utils.py:974] (6/8) features shape: torch.Size([9, 457, 80]) 2022-10-23 14:53:41,559 INFO [utils.py:978] (2/8) num tokens: 118 2022-10-23 14:53:41,562 INFO [utils.py:978] (6/8) num tokens: 97 2022-10-23 14:53:41,590 INFO [utils.py:974] (5/8) features shape: torch.Size([9, 514, 80]) 2022-10-23 14:53:41,591 INFO [utils.py:978] (5/8) num tokens: 138 2022-10-23 14:53:41,595 INFO [utils.py:974] (4/8) features shape: torch.Size([9, 950, 80]) 2022-10-23 14:53:41,595 INFO [utils.py:974] (1/8) features shape: torch.Size([9, 775, 80]) 2022-10-23 14:53:41,595 INFO [utils.py:974] (7/8) features shape: torch.Size([9, 1557, 80]) 2022-10-23 14:53:41,596 INFO [utils.py:978] (1/8) num tokens: 306 2022-10-23 14:53:41,597 INFO [utils.py:978] (4/8) num tokens: 439 2022-10-23 14:53:41,597 INFO [utils.py:978] (7/8) num tokens: 742

csukuangfj commented 1 year ago

Saving batch to conv_emformer_transducer_stateless2/exp/batch-b39cfd4b-8abe-ad78-8520-10116895cea8.pt

Could you upload the above file?

is2022 commented 1 year ago

It seems that it was not produced! Here is the list of files in the "egs/librispeech/ASR/conv_emformer_transducer_stateless2/exp" folder: -rw-r--r--. 1 root root 1208562503 Oct 24 12:11 best-train-loss.pt -rw-r--r--. 1 root root 1208562503 Oct 24 12:11 best-valid-loss.pt -rw-r--r--. 1 root root 1208562247 Oct 23 12:48 checkpoint-248000.pt -rw-r--r--. 1 root root 1208562247 Oct 23 13:59 checkpoint-256000.pt -rw-r--r--. 1 root root 1208562311 Oct 23 15:09 checkpoint-264000.pt -rw-r--r--. 1 root root 1208562311 Oct 23 16:21 checkpoint-272000.pt -rw-r--r--. 1 root root 1208562311 Oct 23 17:33 checkpoint-280000.pt -rw-r--r--. 1 root root 1208562311 Oct 23 18:44 checkpoint-288000.pt -rw-r--r--. 1 root root 1208562311 Oct 23 19:54 checkpoint-296000.pt -rw-r--r--. 1 root root 1208562311 Oct 23 21:05 checkpoint-304000.pt -rw-r--r--. 1 root root 1208562375 Oct 23 22:16 checkpoint-312000.pt -rw-r--r--. 1 root root 1208562375 Oct 23 23:27 checkpoint-320000.pt -rw-r--r--. 1 root root 1208562439 Oct 24 00:37 checkpoint-328000.pt -rw-r--r--. 1 root root 1208562439 Oct 24 01:48 checkpoint-336000.pt -rw-r--r--. 1 root root 1208562439 Oct 24 02:59 checkpoint-344000.pt -rw-r--r--. 1 root root 1208562503 Oct 24 04:08 checkpoint-352000.pt -rw-r--r--. 1 root root 1208562503 Oct 24 05:20 checkpoint-360000.pt -rw-r--r--. 1 root root 1208562503 Oct 24 06:34 checkpoint-368000.pt -rw-r--r--. 1 root root 1208562503 Oct 24 07:46 checkpoint-376000.pt -rw-r--r--. 1 root root 1208562503 Oct 24 09:51 checkpoint-384000.pt -rw-r--r--. 1 root root 1208562503 Oct 24 11:17 checkpoint-392000.pt -rw-r--r--. 1 root root 1208562567 Oct 24 12:27 checkpoint-400000.pt -rw-r--r--. 1 root root 1208560647 Oct 21 22:35 epoch-1.pt -rw-r--r--. 1 root root 1208562183 Oct 23 10:46 epoch-10.pt -rw-r--r--. 1 root root 1208562183 Oct 23 14:13 epoch-11.pt -rw-r--r--. 1 root root 1208562247 Oct 23 17:42 epoch-12.pt -rw-r--r--. 1 root root 1208562311 Oct 23 21:09 epoch-13.pt -rw-r--r--. 1 root root 1208562375 Oct 24 00:36 epoch-14.pt -rw-r--r--. 1 root root 1208562439 Oct 24 04:03 epoch-15.pt -rw-r--r--. 1 root root 1208562439 Oct 24 07:35 epoch-16.pt -rw-r--r--. 1 root root 1208562503 Oct 24 12:11 epoch-17.pt -rw-r--r--. 1 root root 1208560711 Oct 22 02:09 epoch-2.pt -rw-r--r--. 1 root root 1208561799 Oct 22 05:36 epoch-3.pt -rw-r--r--. 1 root root 1208561863 Oct 22 09:07 epoch-4.pt -rw-r--r--. 1 root root 1208561927 Oct 22 12:34 epoch-5.pt -rw-r--r--. 1 root root 1208561991 Oct 22 16:02 epoch-6.pt -rw-r--r--. 1 root root 1208561991 Oct 22 19:31 epoch-7.pt -rw-r--r--. 1 root root 1208562055 Oct 23 03:47 epoch-8.pt -rw-r--r--. 1 root root 1208562119 Oct 23 07:17 epoch-9.pt

csukuangfj commented 1 year ago

It seems that it was not produced!

That is strange.

https://github.com/k2-fsa/icefall/blob/499ac24ecba64f687ff244c7d66baa5c222ecf0f/icefall/utils.py#L967-L969 saves it to a .pt file.

Could you change the filename by following https://github.com/k2-fsa/icefall/blob/499ac24ecba64f687ff244c7d66baa5c222ecf0f/icefall/utils.py#L126-L135

You can embed the rank and world size into the filename.

is2022 commented 1 year ago

I restarted the training from the start of epoch 3 "--start-epoch 3" and it is in the middle of epoch 6 now. It seems very peculiar to me that it got inf loss once, and now it has passed over that!

is2022 commented 1 year ago

@csukuangfj This time I got the inf loss error in epoch 7. 2022-10-29 05:41:34,234 INFO [train_sh.py:907] (3/8) Epoch 7, batch 124300, loss[loss=0.1475, simple_loss=0.2006, pruned_loss=0.04714, over 2279.00 frames. utt_duration=1014 frames, utt_pad_proportion=0.0452, over 9.00 utterances.], tot_loss[loss=0.1652, simple_loss=0.2088, pruned_loss=0.06074, over 313370.06 frames. utt_duration=697.9 frames, utt_pad_proportion=0.1898, over 1800.00 utterances.], batch size: 9, lr: 1.90e-04 2022-10-29 05:41:34,244 INFO [train_sh.py:907] (4/8) Epoch 7, batch 124300, loss[loss=0.1259, simple_loss=0.1589, pruned_loss=0.04648, over 2457.00 frames. utt_duration=1093 frames, utt_pad_proportion=0.03746, over 9.00 utterances.], tot_loss[loss=0.1631, simple_loss=0.2059, pruned_loss=0.06016, over 296834.87 frames. utt_duration=661.1 frames, utt_pad_proportion=0.1935, over 1800.00 utterances.], batch size: 9, lr: 1.90e-04 2022-10-29 05:41:52,398 INFO [train_sh.py:715] (0/8) Not all losses are finite! simple_loss: inf pruned_loss: inf 2022-10-29 05:41:52,399 INFO [utils.py:978] (0/8) Saving batch to conv_emformer_transducer_stateless2/exp/batch-8e91579a-21c3-a39e-50c1-91728c541241-0-8.pt 2022-10-29 05:41:52,402 INFO [utils.py:984] (0/8) features shape: torch.Size([9, 533, 80]) 2022-10-29 05:41:52,402 INFO [utils.py:988] (0/8) num tokens: 155

Here is supervision data of the batch: 'supervisions': {'text': ['OKAY ARE YOU ABLE TO TELL ME THE PRICE DIFFERENCE BETWEEN THE LIQUID FORM AND THE PAD', 'OKAY CAUSE OKAY CAUSE I WANTED A WAY THAT I COULD GO THROUGH SLOWLY IF I NEEDED TO', 'OKAY GREAT NOW COULD I GET YOU TO VERIFY YOUR DATE OF BIRTH PLEASE', 'AWESOME OKAY AND THEN DOES THAT WEBSITE HAVE COUPONS', 'OKAY WHAT WOULD BE THE NEAREST BRANCH TO ME THEN', 'NO THANK YOU AND COULD I FURTHER ASSIST YOU WITH ANYTHING', 'YEAH SURE SURE BYE', 'PERFECT THANK YOU SO MUCH FOR YOUR HELP', 'IT IS AETNA'], 'sequence_idx': tensor([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=torch.int32), 'start_frame': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.int32), 'num_frames': tensor([533, 525, 523, 380, 351, 280, 230, 204, 102], dtype=torch.int32)

csukuangfj commented 1 year ago

The batch that causes inf losses looks normal. I am not sure what happens. Maybe @danpovey has more insight about it.

is2022 commented 1 year ago

One thing that I'm confused about is that I have "--max-duration 280" but for this batch the total duration is more than 31 secs. Shouldn't it be less than 28 secs?

csukuangfj commented 1 year ago

but for this batch the total duration is more than 31 secs

--max-duration specifies the max value of the sum of durations of all utterances in a batch. If there are N-1 utterances in the batch and the sum of the durations of those N-1 utterances are less than --max-duration, it will add another utterance to the batch. If the sum exceeds this value, it stops adding utterances into it.

danpovey commented 1 year ago

@csukuangfj in the new version of the scripts that we are working on, you may notice a new "--inf-check" option that can locate where the inf came from. That can easily be backported to other recipes.

In half-precision training (--use-fp16=True) it can be easy to get infinities, they come from the attention module usually, from scores outside the half-precision range.

is2022 commented 1 year ago

Just a quick note, in my script I haven't enabled the half-precision training.