The loss and metric suddenly became abnormal during training.

hollandloprabbit commented 2 weeks ago

I was training SCNet-large (without using mixed precision training), and the loss and metric suddenly dropped in the middle of the training. How should I handle this situation?

2024-11-09 12:58:25,306 - INFO - ----------------------------------------------------------------------
2024-11-09 12:58:25,306 - INFO - Training Epoch 61 ...
2024-11-09 13:26:45,479 - INFO - Train Summary | Epoch 61 | Loss=0.0548 | Grad=0.0438
2024-11-09 13:26:45,479 - INFO - ----------------------------------------------------------------------
2024-11-09 13:26:45,479 - INFO - Cross validation...
2024-11-09 13:32:05,373 - INFO - Valid Summary | Epoch 61 | Loss=0.1420 | Nsdr=9.094
2024-11-09 13:32:06,420 - INFO - Learning rate adjusted to 0.0002657527142592
2024-11-09 13:32:06,421 - INFO - ----------------------------------------------------------------------
2024-11-09 13:32:06,421 - INFO - Training Epoch 62 ...
2024-11-09 14:02:04,930 - INFO - Train Summary | Epoch 62 | Loss=0.1112 | Grad=1.0411
2024-11-09 14:02:04,930 - INFO - ----------------------------------------------------------------------
2024-11-09 14:02:04,930 - INFO - Cross validation...
2024-11-09 14:07:13,337 - INFO - Valid Summary | Epoch 62 | Loss=0.1595 | Nsdr=7.740
2024-11-09 14:07:14,359 - INFO - Learning rate adjusted to 0.0002657527142592
2024-11-09 14:07:14,359 - INFO - ----------------------------------------------------------------------
2024-11-09 14:07:14,359 - INFO - Training Epoch 63 ...
2024-11-09 14:35:24,521 - INFO - Train Summary | Epoch 63 | Loss=0.1143 | Grad=0.1646
2024-11-09 14:35:24,521 - INFO - ----------------------------------------------------------------------
2024-11-09 14:35:24,521 - INFO - Cross validation...
2024-11-09 14:40:34,042 - INFO - Valid Summary | Epoch 63 | Loss=0.1615 | Nsdr=7.577
2024-11-09 14:40:35,062 - INFO - Learning rate adjusted to 0.0002657527142592
2024-11-09 14:40:35,063 - INFO - ----------------------------------------------------------------------
2024-11-09 14:40:35,063 - INFO - Training Epoch 64 ...
2024-11-09 15:07:52,362 - INFO - Train Summary | Epoch 64 | Loss=11.0862 | Grad=1651.3630
2024-11-09 15:07:52,362 - INFO - ----------------------------------------------------------------------
2024-11-09 15:07:52,362 - INFO - Cross validation...
2024-11-09 15:13:00,251 - INFO - Valid Summary | Epoch 64 | Loss=0.1682 | Nsdr=7.171
2024-11-09 15:13:01,288 - INFO - Learning rate adjusted to 0.0002657527142592
2024-11-09 15:13:01,289 - INFO - ----------------------------------------------------------------------
2024-11-09 15:13:01,289 - INFO - Training Epoch 65 ...
2024-11-09 15:15:03,930 - INFO - Total number of parameters: 42181232
2024-11-09 15:15:09,915 - INFO - train/valid set size: 43723 50
2024-11-09 15:15:10,236 - INFO - Learning rate adjusted to 0.0005
2024-11-09 15:15:10,237 - INFO - ----------------------------------------------------------------------

starrytong commented 2 weeks ago

I haven't tried training a large model without AMP. May I ask what batch size you used? Perhaps you could try further lowering the learning rate.

hollandloprabbit commented 1 week ago

The batch size I used is 4, and your paper states that the learning rate is 5e-4 when using only the MUSDB18 dataset. But in the config file you provided, the learning rate is 3e-4. May I ask, what is the appropriate learning rate to adjust when training a large version of the model?

starrytong commented 1 week ago

A learning rate of 5e-4 corresponds to the standard version of SCNet. In the configuration file I provided, a learning rate of 3e-4 corresponds to the large version. However, I trained with AMP and if you don't use it, the learning rate might need to be further reduced.

starrytong / SCNet

The loss and metric suddenly became abnormal during training. #19