Removing the downsampling layer leads to a high WER

oshindow commented 2 years ago

Hi, Using the default conformer_ctc/train.py and conformer_ctc/decode.py scripts of the Aishell example, I can get a training total loss=0.2 and a validation total loss=0.16 at epoch=7, and a decode WER around 8.x.

But since the default model with 4 times downsampling, gets a resolution of 0.04s for the timestamps. I hope to remove the downsampling layer to get more accurate timestampes. After changing the downsample layer to a linear layer, or modifying the stride and padding of the convolutional network and also, adjust the subsampling_factor to 1. At epoch=1, the training total loss can be reduced to 0.14, and the validation total loss is 0.15, but the WER of decoding is 88.xx.

The decoding result is poor after the model seems to have converged, and only the first few words of each sentence can be decoded.

oshindow commented 2 years ago

default model: max-duration = 200

log:

...
2022-11-25 04:28:40,033 INFO [train.py:512] (3/4) Epoch 7, batch 1850, loss[ctc_loss=0.1082, att_loss=0.2322, loss=0.195, over 4796.00 frames. ], tot_loss[ctc_loss=0.1164, att_loss=0.239, loss=0.2022, over 956037.74 frames. ], batch size: 29
2022-11-25 04:29:14,264 INFO [train.py:512] (3/4) Epoch 7, batch 1900, loss[ctc_loss=0.118, att_loss=0.249, loss=0.2097, over 4843.00 frames. ], tot_loss[ctc_loss=0.1163, att_loss=0.2389, loss=0.2021, over 954733.93 frames. ], batch size: 37
2022-11-25 04:29:47,567 INFO [train.py:512] (3/4) Epoch 7, batch 1950, loss[ctc_loss=0.09974, att_loss=0.2296, loss=0.1907, over 4856.00 frames. ], tot_loss[ctc_loss=0.1157, att_loss=0.2388, loss=0.2019, over 957035.01 frames. ], batch size: 37
2022-11-25 04:30:19,513 INFO [train.py:512] (3/4) Epoch 7, batch 2000, loss[ctc_loss=0.1163, att_loss=0.2486, loss=0.2089, over 4832.00 frames. ], tot_loss[ctc_loss=0.1159, att_loss=0.2383, loss=0.2016, over 954000.42 frames. ], batch size: 51
2022-11-25 04:30:19,514 INFO [train.py:529] (3/4) Computing validation loss
2022-11-25 04:30:44,486 INFO [train.py:538] (3/4) Epoch 7, validation: ctc_loss=0.05068, att_loss=0.211, loss=0.1629, over 1622729.00 frames. 
2022-11-25 04:31:15,483 INFO [train.py:512] (3/4) Epoch 7, batch 2050, loss[ctc_loss=0.0982, att_loss=0.2067, loss=0.1741, over 4744.00 frames. ], tot_loss[ctc_loss=0.1151, att_loss=0.2374, loss=0.2007, over 954119.93 frames. ], batch size: 26
2022-11-25 04:31:47,114 INFO [train.py:512] (3/4) Epoch 7, batch 2100, loss[ctc_loss=0.1601, att_loss=0.2742, loss=0.24, over 4878.00 frames. ], tot_loss[ctc_loss=0.1149, att_loss=0.2379, loss=0.201, over 954653.91 frames. ], batch size: 43
2022-11-25 04:32:44,022 INFO [train.py:512] (3/4) Epoch 8, batch 0, loss[ctc_loss=0.1279, att_loss=0.2415, loss=0.2074, over 4861.00 frames. ], tot_loss[ctc_loss=0.1279, att_loss=0.2415, loss=0.2074, over 4861.00 frames. ], batch size: 57
...

linear input layer: max-duration = 100

self.encoder_embed = nn.Linear(num_features, d_model)

log:

...
2022-11-25 08:04:00,335 INFO [train.py:512] (3/4) Epoch 1, batch 1850, loss[ctc_loss=0.1954, att_loss=0.1571, loss=0.1686, over 9806.00 frames. ], tot_loss[ctc_loss=0.196, att_loss=0.1606, loss=0.1712, over 1894948.66 frames. ], batch size: 21
2022-11-25 08:04:42,968 INFO [train.py:512] (3/4) Epoch 1, batch 1900, loss[ctc_loss=0.1863, att_loss=0.1531, loss=0.1631, over 9688.00 frames. ], tot_loss[ctc_loss=0.1955, att_loss=0.1601, loss=0.1707, over 1896855.38 frames. ], batch size: 19
2022-11-25 08:05:27,511 INFO [train.py:512] (3/4) Epoch 1, batch 1950, loss[ctc_loss=0.2106, att_loss=0.165, loss=0.1787, over 9540.00 frames. ], tot_loss[ctc_loss=0.1948, att_loss=0.1595, loss=0.1701, over 1895997.71 frames. ], batch size: 26
2022-11-25 08:06:11,834 INFO [train.py:512] (3/4) Epoch 1, batch 2000, loss[ctc_loss=0.2044, att_loss=0.171, loss=0.181, over 9832.00 frames. ], tot_loss[ctc_loss=0.1945, att_loss=0.1591, loss=0.1697, over 1897898.38 frames. ], batch size: 21
2022-11-25 08:06:11,834 INFO [train.py:529] (3/4) Computing validation loss
2022-11-25 08:07:06,694 INFO [train.py:538] (3/4) Epoch 1, validation: ctc_loss=0.1893, att_loss=0.1604, loss=0.1691, over 6512315.00 frames. 
...
2022-11-25 08:39:03,480 INFO [train.py:512] (3/4) Epoch 1, batch 4150, loss[ctc_loss=0.2197, att_loss=0.1684, loss=0.1838, over 9434.00 frames. ], tot_loss[ctc_loss=0.1898, att_loss=0.15, loss=0.1619, over 1898860.84 frames. ], batch size: 19
2022-11-25 08:39:45,326 INFO [train.py:512] (3/4) Epoch 1, batch 4200, loss[ctc_loss=0.1869, att_loss=0.1434, loss=0.1564, over 9858.00 frames. ], tot_loss[ctc_loss=0.1903, att_loss=0.1503, loss=0.1623, over 1901405.51 frames. ], batch size: 23
2022-11-25 08:40:27,532 INFO [train.py:512] (3/4) Epoch 1, batch 4250, loss[ctc_loss=0.2009, att_loss=0.1585, loss=0.1712, over 9709.00 frames. ], tot_loss[ctc_loss=0.1898, att_loss=0.1497, loss=0.1617, over 1901266.65 frames. ], batch size: 24
2022-11-25 08:41:11,178 INFO [train.py:512] (3/4) Epoch 1, batch 4300, loss[ctc_loss=0.1829, att_loss=0.143, loss=0.155, over 9773.00 frames. ], tot_loss[ctc_loss=0.1884, att_loss=0.1487, loss=0.1606, over 1887813.74 frames. ], batch size: 18
2022-11-25 08:41:56,916 INFO [train.py:512] (3/4) Epoch 2, batch 0, loss[ctc_loss=0.1938, att_loss=0.1513, loss=0.164, over 9707.00 frames. ], tot_loss[ctc_loss=0.1938, att_loss=0.1513, loss=0.164, over 9707.00 frames. ], batch size: 31
...

conv input layer: max-duration = 80

        self.conv = nn.Sequential(
            nn.Conv2d(
                in_channels=1, out_channels=odim, kernel_size=3, stride=1, padding=1
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=odim, out_channels=odim, kernel_size=3, stride=1, padding=1
            ),
            nn.ReLU(),
        )

        self.out = nn.Linear(odim * idim, odim)

log:

2022-11-27 10:33:11,172 INFO [train.py:512] (3/4) Epoch 2, batch 3850, loss[ctc_loss=0.1931, att_loss=0.147, loss=0.1608, over 7543.00 frames. ], tot_loss[ctc_loss=0.1802, att_loss=0.1353, loss=0.1487, over 1504628.66 frames. ], batch size: 23
2022-11-27 10:34:17,783 INFO [train.py:512] (3/4) Epoch 2, batch 3900, loss[ctc_loss=0.1929, att_loss=0.1412, loss=0.1568, over 7913.00 frames. ], tot_loss[ctc_loss=0.1799, att_loss=0.1347, loss=0.1482, over 1503819.94 frames. ], batch size: 16
2022-11-27 10:35:23,413 INFO [train.py:512] (3/4) Epoch 2, batch 3950, loss[ctc_loss=0.193, att_loss=0.1423, loss=0.1575, over 7530.00 frames. ], tot_loss[ctc_loss=0.1803, att_loss=0.135, loss=0.1486, over 1507065.92 frames. ], batch size: 23
2022-11-27 10:36:29,280 INFO [train.py:512] (3/4) Epoch 2, batch 4000, loss[ctc_loss=0.1497, att_loss=0.1091, loss=0.1213, over 7659.00 frames. ], tot_loss[ctc_loss=0.1805, att_loss=0.1352, loss=0.1488, over 1505438.38 frames. ], batch size: 11
2022-11-27 10:36:29,281 INFO [train.py:529] (3/4) Computing validation loss
2022-11-27 10:38:27,992 INFO [train.py:538] (3/4) Epoch 2, validation: ctc_loss=0.1758, att_loss=0.145, loss=0.1543, over 6512315.00 frames. 
...
2022-11-27 11:07:12,591 INFO [train.py:512] (3/4) Epoch 2, batch 5300, loss[ctc_loss=0.1973, att_loss=0.1465, loss=0.1617, over 7707.00 frames. ], tot_loss[ctc_loss=0.1805, att_loss=0.1338, loss=0.1478, over 1503153.06 frames. ], batch size: 19
2022-11-27 11:08:17,623 INFO [train.py:512] (3/4) Epoch 2, batch 5350, loss[ctc_loss=0.167, att_loss=0.1244, loss=0.1372, over 7403.00 frames. ], tot_loss[ctc_loss=0.1789, att_loss=0.1325, loss=0.1464, over 1499715.62 frames. ], batch size: 12
2022-11-27 11:09:21,871 INFO [train.py:512] (3/4) Epoch 2, batch 5400, loss[ctc_loss=0.126, att_loss=0.09863, loss=0.1068, over 7099.00 frames. ], tot_loss[ctc_loss=0.1772, att_loss=0.1314, loss=0.1451, over 1490781.35 frames. ], batch size: 8
2022-11-27 11:10:54,249 INFO [train.py:512] (3/4) Epoch 3, batch 0, loss[ctc_loss=0.192, att_loss=0.1493, loss=0.1621, over 7537.00 frames. ], tot_loss[ctc_loss=0.192, att_loss=0.1493, loss=0.1621, over 7537.00 frames. ], batch size: 22
...

csukuangfj commented 2 years ago

At epoch=1, the training total loss can be reduced to 0.14, and the validation total loss is 0.15, but the WER of decoding is 88.xx.

Do you also change the subsampling factor for decode.py?

oshindow commented 2 years ago

Yes, I do. The decode params are as follows:
params = AttributeDict(
{
# parameters for conformer
"subsampling_factor": 1,
"feature_dim": 80,
"nhead": 4,
"attention_dim": 512,
"num_encoder_layers": 12,
"num_decoder_layers": 6,
"vgg_frontend": False,
"use_feat_batchnorm": True,
# parameters for decoder
"search_beam": 20,
"output_beam": 10,
"min_active_states": 30,
"max_active_states": 10000,
"use_double_scores": True,
"env_info": get_env_info(),
}
)
and the decode command is :
python conformer_ctc/decode.py --nbest-scale 0.5 \
--epoch 2 \
--avg 1 \
--method attention-decoder \
--max-duration 20 \
--num-paths 1 \
--exp-dir conformer_ctc/exp_ss1_conv
At epoch=1, the training total loss can be reduced to 0.14, and the validation total loss is 0.15, but the WER of decoding is 88.xx.

Do you also change the subsampling factor for decode.py?

csukuangfj commented 2 years ago

Could you please show the output of

git diff

so that we can see what exactly has been changed

oshindow commented 2 years ago

diff --git a/egs/aishell/ASR/conformer_ctc/decode.py b/egs/aishell/ASR/conformer_ctc/decode.py
index 751b7d5..f561d69 100755
--- a/egs/aishell/ASR/conformer_ctc/decode.py
+++ b/egs/aishell/ASR/conformer_ctc/decode.py
@@ -57,14 +57,14 @@ def get_parser():
     parser.add_argument(
         "--epoch",
         type=int,
-        default=49,
+        default=2,
         help="It specifies the checkpoint to use for decoding."
         "Note: Epoch counts from 0.",
     )
     parser.add_argument(
         "--avg",
         type=int,
-        default=20,
+        default=1,
         help="Number of checkpoints to average. Automatically select "
         "consecutive checkpoints before the checkpoint specified by "
         "'--epoch'. ",
@@ -115,7 +115,7 @@ def get_parser():
     parser.add_argument(
         "--exp-dir",
         type=str,
-        default="conformer_ctc/exp",
+        default="conformer_ctc/exp_ss",
         help="The experiment dir",
     )

@@ -142,7 +142,7 @@ def get_params() -> AttributeDict:
     params = AttributeDict(
         {
             # parameters for conformer
-            "subsampling_factor": 4,
+            "subsampling_factor": 1,
             "feature_dim": 80,
             "nhead": 4,
             "attention_dim": 512,
@@ -151,8 +151,8 @@ def get_params() -> AttributeDict:
             "vgg_frontend": False,
             "use_feat_batchnorm": True,
             # parameters for decoder
-            "search_beam": 20,
-            "output_beam": 7,
+            "search_beam": 10,
+            "output_beam": 10,
             "min_active_states": 30,
             "max_active_states": 10000,
             "use_double_scores": True,
@@ -373,9 +373,10 @@ def decode_dataset(

     results = defaultdict(list)
     for batch_idx, batch in enumerate(dl):
+        
         texts = batch["supervisions"]["text"]
         cut_ids = [cut.id for cut in batch["supervisions"]["cut"]]
-
+        #print(len(texts))
         hyps_dict = decode_one_batch(
             params=params,
             model=model,
@@ -545,8 +546,9 @@ def main():
     args.return_cuts = True
     aishell = AishellAsrDataModule(args)
     test_cuts = aishell.test_cuts()
+    print('create test dataloaders')
     test_dl = aishell.test_dataloaders(test_cuts)
-
+    print('create test dataloaders successfully')
     test_sets = ["test"]
     test_dls = [test_dl]

diff --git a/egs/aishell/ASR/conformer_ctc/export.py b/egs/aishell/ASR/conformer_ctc/export.py
index 42b8c29..4a07ee3 100644
--- a/egs/aishell/ASR/conformer_ctc/export.py
+++ b/egs/aishell/ASR/conformer_ctc/export.py
@@ -85,7 +85,7 @@ def get_params() -> AttributeDict:
     params = AttributeDict(
         {
             "feature_dim": 80,
-            "subsampling_factor": 4,
+            "subsampling_factor": 1,
             "use_feat_batchnorm": True,
             "attention_dim": 512,
             "nhead": 4,
diff --git a/egs/aishell/ASR/conformer_ctc/subsampling.py b/egs/aishell/ASR/conformer_ctc/subsampling.py
index 542fb03..b7d2d41 100644
--- a/egs/aishell/ASR/conformer_ctc/subsampling.py
+++ b/egs/aishell/ASR/conformer_ctc/subsampling.py
@@ -43,15 +43,16 @@ class Conv2dSubsampling(nn.Module):
         super().__init__()
         self.conv = nn.Sequential(
             nn.Conv2d(
-                in_channels=1, out_channels=odim, kernel_size=3, stride=2
+                in_channels=1, out_channels=odim, kernel_size=3, stride=1, padding=1
             ),
             nn.ReLU(),
             nn.Conv2d(
-                in_channels=odim, out_channels=odim, kernel_size=3, stride=2
+                in_channels=odim, out_channels=odim, kernel_size=3, stride=1, padding=1
             ),
             nn.ReLU(),
         )
-        self.out = nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim)
+        #self.out = nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim)
+        self.out = nn.Linear(odim * idim, odim)

     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Subsample x.
diff --git a/egs/aishell/ASR/conformer_ctc/train.py b/egs/aishell/ASR/conformer_ctc/train.py
index a228cc1..52c14a3 100755
--- a/egs/aishell/ASR/conformer_ctc/train.py
+++ b/egs/aishell/ASR/conformer_ctc/train.py
@@ -57,7 +57,7 @@ def get_parser():
     parser.add_argument(
         "--world-size",
         type=int,
-        default=1,
+        default=4,
         help="Number of GPUs for DDP training.",
     )

@@ -95,7 +95,7 @@ def get_parser():
     parser.add_argument(
         "--exp-dir",
         type=str,
-        default="conformer_ctc/exp",
+        default="conformer_ctc/exp_ss1_conv",
         help="""The experiment dir.
         It specifies the directory where all training related
         files, e.g., checkpoints, log, etc, are saved
@@ -203,7 +203,7 @@ def get_params() -> AttributeDict:
             "reduction": "sum",
             "use_double_scores": True,
             # parameters for conformer
-            "subsampling_factor": 4,
+            "subsampling_factor": 1,
             "feature_dim": 80,
             "attention_dim": 512,
             "nhead": 4,
@@ -485,9 +485,10 @@ def train_one_epoch(
     tot_loss = MetricsTracker()

     for batch_idx, batch in enumerate(train_dl):
+        
         params.batch_idx_train += 1
         batch_size = len(batch["supervisions"]["text"])
-
+        #print(batch_size)
         loss, loss_info = compute_loss(
             params=params,
             model=model,
@@ -602,7 +603,7 @@ def run(rank, world_size, args):
         vgg_frontend=False,
         use_feat_batchnorm=params.use_feat_batchnorm,
     )
-
+    print(model)
     checkpoints = load_checkpoint_if_available(params=params, model=model)

     model.to(device)
diff --git a/egs/aishell/ASR/conformer_ctc/transformer.py b/egs/aishell/ASR/conformer_ctc/transformer.py
index f93914a..90f575a 100644
--- a/egs/aishell/ASR/conformer_ctc/transformer.py
+++ b/egs/aishell/ASR/conformer_ctc/transformer.py
@@ -81,8 +81,8 @@ class Transformer(nn.Module):
         self.num_features = num_features
         self.num_classes = num_classes
         self.subsampling_factor = subsampling_factor
-        if subsampling_factor != 4:
-            raise NotImplementedError("Support only 'subsampling_factor=4'.")
+        #if subsampling_factor != 4:
+        #    raise NotImplementedError("Support only 'subsampling_factor=4'.")

         # self.encoder_embed converts the input of shape (N, T, num_classes)
         # to the shape (N, T//subsampling_factor, d_model).
@@ -93,7 +93,7 @@ class Transformer(nn.Module):
             self.encoder_embed = VggSubsampling(num_features, d_model)
         else:
             self.encoder_embed = Conv2dSubsampling(num_features, d_model)
-
+            #self.encoder_embed = nn.Linear(num_features, d_model)
         self.encoder_pos = PositionalEncoding(d_model, dropout)

         encoder_layer = TransformerEncoderLayer(
diff --git a/egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py b/egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py
index d24ba6b..8fe179c 100644
--- a/egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py
+++ b/egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py
@@ -78,7 +78,7 @@ class AishellAsrDataModule:
         group.add_argument(
             "--max-duration",
             type=int,
-            default=200.0,
+            default=80.0,
             help="Maximum pooled recordings duration (seconds) in a "
             "single batch. You can reduce it if it causes CUDA OOM.",
         )

Could you please show the output of
git diff
so that we can see what exactly has been changed

oshindow commented 2 years ago

and the CER for ctc-decoding is

ctc-decoding    87.34

csukuangfj commented 2 years ago

At epoch=1, the training total loss can be reduced to 0.14, and the validation total loss is 0.15, but the WER of decoding is 88.xx.

The loss is averaged over all frames. Since you are using subsampling factor==1, you effectively increase the number of feature frames when computing the average.

I suggest that you train the model for more epochs.

oshindow commented 2 years ago

I suggest that you train the model for more epochs.

I also trained a model with a linear input layer over 63 epochs, the training loss is down to 0.1018, but the validation loss won't down since epoch 9(val loss=0.15) and the val loss at epoch 63 is up to 0.1668. Is this seem like an overfitting problem? And still, only the first few words of each sentence can be decoded.

training log:

...
2022-11-25 00:10:59,013 INFO [train.py:529] (3/4) Computing validation loss
2022-11-25 00:12:00,241 INFO [train.py:538] (3/4) Epoch 63, validation: ctc_loss=0.169, att_loss=0.1659, loss=0.1668, over 6512315.00 frames. 
...
2022-11-25 00:23:03,693 INFO [train.py:512] (3/4) Epoch 63, batch 7100, loss[ctc_loss=0.1711, att_loss=0.07278, loss=0.1023, over 5779.00 frames. ], tot_loss[ctc_loss=0.1755, att_loss=0.07464, loss=0.1049, over 1120268.00 frames. ], batch size: 10
2022-11-25 00:23:32,981 INFO [train.py:512] (3/4) Epoch 63, batch 7150, loss[ctc_loss=0.1624, att_loss=0.06555, loss=0.09462, over 5505.00 frames. ], tot_loss[ctc_loss=0.176, att_loss=0.07467, loss=0.1051, over 1117408.24 frames. ], batch size: 8
2022-11-25 00:24:02,118 INFO [train.py:512] (3/4) Epoch 63, batch 7200, loss[ctc_loss=0.1812, att_loss=0.07223, loss=0.1049, over 5441.00 frames. ], tot_loss[ctc_loss=0.1752, att_loss=0.0741, loss=0.1044, over 1110711.99 frames. ], batch size: 10
2022-11-25 00:24:31,838 INFO [train.py:512] (3/4) Epoch 63, batch 7250, loss[ctc_loss=0.185, att_loss=0.09373, loss=0.1211, over 5256.00 frames. ], tot_loss[ctc_loss=0.1743, att_loss=0.07371, loss=0.1039, over 1103355.27 frames. ], batch size: 21
2022-11-25 00:25:01,943 INFO [train.py:512] (3/4) Epoch 63, batch 7300, loss[ctc_loss=0.1681, att_loss=0.07336, loss=0.1018, over 5640.00 frames. ], tot_loss[ctc_loss=0.1738, att_loss=0.07414, loss=0.104, over 1094086.30 frames. ], batch size: 12
2022-11-25 00:25:34,936 INFO [train.py:512] (3/4) Epoch 64, batch 0, loss[ctc_loss=0.167, att_loss=0.07096, loss=0.09978, over 5454.00 frames. ], tot_loss[ctc_loss=0.167, att_loss=0.07096, loss=0.09978, over 5454.00 frames. ], batch size: 10
2022-11-25 00:26:05,136 INFO [train.py:512] (3/4) Epoch 64, batch 50, loss[ctc_loss=0.1665, att_loss=0.06461, loss=0.09519, over 5745.00 frames. ], tot_loss[ctc_loss=0.1684, att_loss=0.07287, loss=0.1015, over 251815.98 frames. ], batch size: 9

oshindow commented 2 years ago

For overfitting, I raise the dropout rate of the model to 0.2. Here is the log of a linear input layer, and the attention loss of the validation set increases rather than decreases after the learning rate begins to decline.

GabrielHaoHao commented 1 year ago

I had a similar problem, have you solved it yet?

GabrielHaoHao commented 1 year ago

Each time the test set in aishell1 is decoded it outputs some uniform character, for example:'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan si a da sa san', 'zan i a da sa san', 'zan i a da sa san', 'zan i a da sa san', 'zan si a da san san', 'zan si a da san san'

danpovey commented 1 year ago

@GabrielHaoHao If you are using the conformer_cte recipe you shouldn't be, as there are much better recipes now. Are you sure you didn't make any changes to the scripts?

oshindow commented 1 year ago

No, I didn't. But this may be caused by too much future content involved at 1 encoder-frame in Encoder , so we do need these downsample layers?

danpovey commented 1 year ago

i think it works best with the downsample layers. you can show convergence plots if you want- maybe your model did not converge. but that is not a good recipe.

GabrielHaoHao commented 1 year ago

@oshindow I successfully recognized aishell's test audio by trying to use a linear layer instead of a downsampling layer in the wenet framework.

GabrielHaoHao commented 1 year ago

@danpovey It is true that there is no convergence, neither the ctc nor the attention loss decreases as the number of epochs increases

GabrielHaoHao commented 1 year ago

2023-07-24 23:41:34,159 INFO [train.py:554] (1/2) Epoch 0, batch 8400, loss[ctc_loss=0.302, att_loss=0.2322, loss=0.2531, over 9505.00 frames. utt_duration=559.1 frames, utt_pad_proportion=0.0139, over 17.00 utterances.], tot_loss[ctc_loss=0.2931, att_loss=0.2315, loss=0.25, over 1900551.80 frames. utt_duration=460.4 frames, utt_pad_proportion=0.02389, over 4128.24 utterances.], batch size: 17 2023-07-24 23:41:47,338 INFO [train.py:554] (0/2) Epoch 0, batch 8450, loss[ctc_loss=0.292, att_loss=0.2407, loss=0.2561, over 9406.00 frames. utt_duration=495.1 frames, utt_pad_proportion=0.01384, over 19.00 utterances.], tot_loss[ctc_loss=0.2971, att_loss=0.2339, loss=0.2529, over 1904452.82 frames. utt_duration=450.2 frames, utt_pad_proportion=0.02336, over 4230.63 utterances.], batch size: 19 2023-07-24 23:41:47,349 INFO [train.py:554] (1/2) Epoch 0, batch 8450, loss[ctc_loss=0.2969, att_loss=0.2499, loss=0.264, over 9815.00 frames. utt_duration=467.4 frames, utt_pad_proportion=0.01396, over 21.00 utterances.], tot_loss[ctc_loss=0.2965, att_loss=0.2326, loss=0.2518, over 1901661.73 frames. utt_duration=453.1 frames, utt_pad_proportion=0.0249, over 4197.36 utterances.], batch size: 21 2023-07-24 23:42:00,848 INFO [train.py:554] (1/2) Epoch 0, batch 8500, loss[ctc_loss=0.2854, att_loss=0.2394, loss=0.2532, over 9547.00 frames. utt_duration=454.6 frames, utt_pad_proportion=0.01384, over 21.00 utterances.], tot_loss[ctc_loss=0.2959, att_loss=0.2325, loss=0.2515, over 1899738.04 frames. utt_duration=456.2 frames, utt_pad_proportion=0.02624, over 4164.40 utterances.], batch size: 21 2023-07-24 23:42:00,863 INFO [train.py:554] (0/2) Epoch 0, batch 8500, loss[ctc_loss=0.2862, att_loss=0.2374, loss=0.252, over 9610.00 frames. utt_duration=417.8 frames, utt_pad_proportion=0.01223, over 23.00 utterances.], tot_loss[ctc_loss=0.2952, att_loss=0.2333, loss=0.2519, over 1898918.12 frames. utt_duration=455.8 frames, utt_pad_proportion=0.02384, over 4165.73 utterances.], batch size: 23 2023-07-24 23:42:14,960 INFO [train.py:554] (0/2) Epoch 0, batch 8550, loss[ctc_loss=0.2797, att_loss=0.2419, loss=0.2532, over 9791.00 frames. utt_duration=543.9 frames, utt_pad_proportion=0.01101, over 18.00 utterances.], tot_loss[ctc_loss=0.2907, att_loss=0.2307, loss=0.2487, over 1892816.13 frames. utt_duration=468.8 frames, utt_pad_proportion=0.02474, over 4037.20 utterances.], batch size: 18 2023-07-24 23:42:14,962 INFO [train.py:554] (1/2) Epoch 0, batch 8550, loss[ctc_loss=0.2662, att_loss=0.2232, loss=0.2361, over 9569.00 frames. utt_duration=637.9 frames, utt_pad_proportion=0.01401, over 15.00 utterances.], tot_loss[ctc_loss=0.2952, att_loss=0.2316, loss=0.2507, over 1901428.71 frames. utt_duration=464.4 frames, utt_pad_proportion=0.02547, over 4094.69 utterances.], batch size: 15 ———————————————————————————————————————————————————————— 2023-07-25 10:01:24,117 INFO [train.py:554] (0/2) Epoch 15, batch 8200, loss[ctc_loss=0.2702, att_loss=0.2243, loss=0.238, over 9418.00 frames. utt_duration=495.7 frames, utt_pad_proportion=0.01258, over 19.00 utterances.], tot_loss[ctc_loss=0.2519, att_loss=0.2082, loss=0.2213, over 1894727.04 frames. utt_duration=464 frames, utt_pad_proportion=0.02453, over 4083.40 utterances.], batch size: 19 2023-07-25 10:01:38,358 INFO [train.py:554] (0/2) Epoch 15, batch 8250, loss[ctc_loss=0.2306, att_loss=0.1997, loss=0.209, over 9728.00 frames. utt_duration=694.9 frames, utt_pad_proportion=0.01856, over 14.00 utterances.], tot_loss[ctc_loss=0.2534, att_loss=0.2094, loss=0.2226, over 1893286.78 frames. utt_duration=453.7 frames, utt_pad_proportion=0.0258, over 4173.27 utterances.], batch size: 14 2023-07-25 10:01:38,366 INFO [train.py:554] (1/2) Epoch 15, batch 8250, loss[ctc_loss=0.2608, att_loss=0.212, loss=0.2266, over 9575.00 frames. utt_duration=456 frames, utt_pad_proportion=0.008799, over 21.00 utterances.], tot_loss[ctc_loss=0.2564, att_loss=0.2114, loss=0.2249, over 1905211.70 frames. utt_duration=444.6 frames, utt_pad_proportion=0.02233, over 4285.64 utterances.], batch size: 21 2023-07-25 10:01:51,360 INFO [train.py:554] (1/2) Epoch 15, batch 8300, loss[ctc_loss=0.2547, att_loss=0.2134, loss=0.2258, over 9773.00 frames. utt_duration=574.9 frames, utt_pad_proportion=0.01897, over 17.00 utterances.], tot_loss[ctc_loss=0.2575, att_loss=0.2123, loss=0.2259, over 1906473.26 frames. utt_duration=439.9 frames, utt_pad_proportion=0.02228, over 4333.47 utterances.], batch size: 17 2023-07-25 10:01:51,360 INFO [train.py:554] (0/2) Epoch 15, batch 8300, loss[ctc_loss=0.2475, att_loss=0.2067, loss=0.2189, over 9541.00 frames. utt_duration=596.3 frames, utt_pad_proportion=0.009448, over 16.00 utterances.], tot_loss[ctc_loss=0.2528, att_loss=0.2089, loss=0.2221, over 1893144.00 frames. utt_duration=461 frames, utt_pad_proportion=0.02574, over 4106.92 utterances.], batch size: 16 2023-07-25 10:02:04,495 INFO [train.py:554] (1/2) Epoch 15, batch 8350, loss[ctc_loss=0.2687, att_loss=0.2222, loss=0.2362, over 9697.00 frames. utt_duration=510.4 frames, utt_pad_proportion=0.01283, over 19.00 utterances.], tot_loss[ctc_loss=0.258, att_loss=0.2128, loss=0.2264, over 1905677.21 frames. utt_duration=445 frames, utt_pad_proportion=0.02254, over 4282.07 utterances.], batch size: 19 2023-07-25 10:02:04,513 INFO [train.py:554] (0/2) Epoch 15, batch 8350, loss[ctc_loss=0.2806, att_loss=0.2226, loss=0.24, over 9643.00 frames. utt_duration=507.5 frames, utt_pad_proportion=0.01642, over 19.00 utterances.], tot_loss[ctc_loss=0.2544, att_loss=0.2099, loss=0.2233, over 1895872.28 frames. utt_duration=458.6 frames, utt_pad_proportion=0.02538, over 4134.12 utterances.], batch size: 19 2023-07-25 10:02:17,841 INFO [train.py:554] (1/2) Epoch 15, batch 8400, loss[ctc_loss=0.2683, att_loss=0.2176, loss=0.2328, over 9719.00 frames. utt_duration=405 frames, utt_pad_proportion=0.0147, over 24.00 utterances.], tot_loss[ctc_loss=0.258, att_loss=0.2127, loss=0.2263, over 1907905.56 frames. utt_duration=447.4 frames, utt_pad_proportion=0.02173, over 4264.28 utterances.], batch size: 24 2023-07-25 10:02:17,845 INFO [train.py:554] (0/2) Epoch 15, batch 8400, loss[ctc_loss=0.2597, att_loss=0.2094, loss=0.2245, over 9821.00 frames. utt_duration=467.7 frames, utt_pad_proportion=0.01336, over 21.00 utterances.], tot_loss[ctc_loss=0.2552, att_loss=0.2104, loss=0.2239, over 1896564.32 frames. utt_duration=451.6 frames, utt_pad_proportion=0.02504, over 4199.72 utterances.], batch size: 21 2023-07-25 10:02:31,309 INFO [train.py:554] (0/2) Epoch 15, batch 8450, loss[ctc_loss=0.2676, att_loss=0.228, loss=0.2399, over 9849.00 frames. utt_duration=579.4 frames, utt_pad_proportion=0.01134, over 17.00 utterances.], tot_loss[ctc_loss=0.2555, att_loss=0.2111, loss=0.2244, over 1899608.11 frames. utt_duration=451.6 frames, utt_pad_proportion=0.02494, over 4206.08 utterances.], batch size: 17 2023-07-25 10:02:31,315 INFO [train.py:554] (1/2) Epoch 15, batch 8450, loss[ctc_loss=0.2395, att_loss=0.2017, loss=0.213, over 9233.00 frames. utt_duration=615.5 frames, utt_pad_proportion=0.01672, over 15.00 utterances.], tot_loss[ctc_loss=0.2566, att_loss=0.2124, loss=0.2257, over 1902985.92 frames. utt_duration=451.1 frames, utt_pad_proportion=0.02234, over 4218.98 utterances.], batch size: 15 2023-07-25 10:02:44,844 INFO [train.py:554] (0/2) Epoch 15, batch 8500, loss[ctc_loss=0.2764, att_loss=0.221, loss=0.2376, over 9457.00 frames. utt_duration=295.5 frames, utt_pad_proportion=0.02786, over 32.00 utterances.], tot_loss[ctc_loss=0.2542, att_loss=0.2102, loss=0.2234, over 1894295.07 frames. utt_duration=459.9 frames, utt_pad_proportion=0.02655, over 4118.75 utterances.], batch size: 32 2023-07-25 10:02:44,851 INFO [train.py:554] (1/2) Epoch 15, batch 8500, loss[ctc_loss=0.2671, att_loss=0.2237, loss=0.2367, over 9528.00 frames. utt_duration=381.1 frames, utt_pad_proportion=0.01264, over 25.00 utterances.], tot_loss[ctc_loss=0.2567, att_loss=0.2126, loss=0.2258, over 1899652.45 frames. utt_duration=447.5 frames, utt_pad_proportion=0.025, over 4244.59 utterances.], batch size: 25 2023-07-25 10:02:59,319 INFO [train.py:554] (0/2) Epoch 15, batch 8550, loss[ctc_loss=0.2712, att_loss=0.2247, loss=0.2386, over 8979.00 frames. utt_duration=256.5 frames, utt_pad_proportion=0.07718, over 35.00 utterances.], tot_loss[ctc_loss=0.2519, att_loss=0.2087, loss=0.2217, over 1878621.38 frames. utt_duration=460.6 frames, utt_pad_proportion=0.03093, over 4078.71 utterances.], batch size: 35 2023-07-25 10:02:59,326 INFO [train.py:554] (1/2) Epoch 15, batch 8550, loss[ctc_loss=0.1677, att_loss=0.1511, loss=0.1561, over 7831.00 frames. utt_duration=978.9 frames, utt_pad_proportion=0.2099, over 8.00 utterances.], tot_loss[ctc_loss=0.2537, att_loss=0.2105, loss=0.2234, over 1886784.67 frames. utt_duration=454.6 frames, utt_pad_proportion=0.03054, over 4150.05 utterances.], batch size: 8 2023-07-25 10:03:13,667 INFO [train.py:554] (0/2) Epoch 15, batch 8600, loss[ctc_loss=0.2772, att_loss=0.2206, loss=0.2376, over 8894.00 frames. utt_duration=254.1 frames, utt_pad_proportion=0.0892, over 35.00 utterances.], tot_loss[ctc_loss=0.2513, att_loss=0.208, loss=0.221, over 1864720.14 frames. utt_duration=455.3 frames, utt_pad_proportion=0.03887, over 4095.60 utterances.], batch size: 35 2023-07-25 10:03:13,677 INFO [train.py:554] (1/2) Epoch 15, batch 8600, loss[ctc_loss=0.269, att_loss=0.226, loss=0.2389, over 8736.00 frames. utt_duration=249.6 frames, utt_pad_proportion=0.1054, over 35.00 utterances.], tot_loss[ctc_loss=0.2513, att_loss=0.2088, loss=0.2216, over 1869457.40 frames. utt_duration=453.7 frames, utt_pad_proportion=0.03818, over 4120.70 utterances.], batch size: 35

GabrielHaoHao commented 1 year ago

@GabrielHaoHao If you are using the conformer_cte recipe you shouldn't be, as there are much better recipes now. Are you sure you didn't make any changes to the scripts?

Yes, I'm using the conformer_ctc recipe. I just used a linear layer instead of the original two convolutional layers (4x downsampling), the rest of the model is the structure of the conformer

k2-fsa / icefall

Removing the downsampling layer leads to a high WER #708