Loss has gone crazy after several epochs

Sundragon1993 commented 3 years ago

First of all, thank you so much for the valuable contribution to the community,

I followed your instructions and successfully train the network with 51M_deeplab_all.json config, everything was fine for several first epochs:

Train Epoch: 12 [100]/[100] Loss: 2.269440 mIoU: 0.841808 Teacher mIoU: 0.853646 Supervised Loss: 0.094498 Knowledge Distillation loss: 0.529225 Hint Loss: 2.269440 Teacher Loss: 0.085739
    epoch          : 12
    loss           : 2.269439737395485
    supervised_loss: 0.09449830075891891
    kd_loss        : 0.5292251629404502
    hint_loss      : 2.269439737395485
    teacher_loss   : 0.08573895626434005
    train_teacher_mIoU: 0.8536455017769807
    train_student_mIoU: 0.8418077072807909

Since then, the losses of student model have somehow drastically surged to an irrational number:

Train Epoch: 13 [100]/[100] Loss: 49375739005023.046875 mIoU: 0.189571 Teacher mIoU: 0.840491 Supervised Loss: 1254511.242623 Knowledge Distillation loss: 4423746118372.598633 Hint Loss: 49375739005023.046875 Teacher Loss: 0.089963
    epoch          : 13
    loss           : 49375739005023.05
    supervised_loss: 1254511.2426230733
    kd_loss        : 4423746118372.599
    hint_loss      : 49375739005023.05
    teacher_loss   : 0.08996338581684793
    train_teacher_mIoU: 0.8404906695894594
    train_student_mIoU: 0.1895712015614291

Do you have any idea about this kind of error? I'm using torch: 1.8.0+cu111 and torchvision 0.9.0+cu111. Thank you very much!

votnhan commented 3 years ago

Can you tell us which teacher model you used?

Sundragon1993 commented 3 years ago

Thank you for your reply!

I've tried the 51M_Deeplab_all.json and 51M_Deeplab_incremental.json config file and in addition to this error, the validation procedure after a certain of val_interval was also not working, the error was: the loss (hint loss) has no property backward.

Seems the loss has become a float number rather than a tensor after the validation process...

Sundragon1993 commented 3 years ago

Here is the full traceback:

There is no update ... Traceback (most recent call last): File "train.py", line 92, in main(config) File "train.py", line 73, in main trainer.train() File "/home/gvh205/src/base/base_trainer.py", line 85, in train result = self._train_epoch(epoch) File "/home/gvh205/src/trainer/layerwise_trainer.py", line 251, in _train_epoch loss.backward() AttributeError: 'float' object has no attribute 'backward'

Thanks so much!

lehduong / Knowledge-Distillation-by-Replacing-Cheap-Conv

Loss has gone crazy after several epochs #7