voldemortX / pytorch-auto-drive

PytorchAutoDrive: Segmentation models (ERFNet, ENet, DeepLab, FCN...) and Lane detection models (SCNN, RESA, LSTR, LaneATT, BézierLaneNet...) based on PyTorch with fast training, visualization, benchmarking & deployment help
BSD 3-Clause "New" or "Revised" License
837 stars 137 forks source link

training losses become 0 after doing test among epochs #125

Closed mengxia1994 closed 1 year ago

mengxia1994 commented 1 year ago

I did a little change in LaneDetTrainer and base.py, I comment the fast_evaluate code('Only segmentation based methods can be fast evaluated!') and use the LaneDetTester test_one_set function like below. Trying to add val part among epochs.

if self._cfg['validation']:

# fast_evaluate code
#
                if i == len(self.dataloader) - 1 and epoch > 0 and epoch%self._cfg['val_num_steps'] == 0:
                    print('start validation on epoch: ', epoch)
                    LaneDetTester.test_one_set(self.model, self.device, self.validation_loader, self._cfg['mixed_precision'],
                          [self._cfg['input_size'], self._cfg['original_size']],
                          10, 42, None,
                          'tusimple', False, 6, self._cfg['exp_name'], val_json_gt)
                    self.model.train()
                    save_checkpoint(net=self.model.module if self._cfg['distributed'] else self.model,
                        optimizer=None,
                        lr_scheduler=None,
                        filename=os.path.join(self._cfg['exp_dir'], 'model_' + str(epoch) + '.pt'))

as you can see, I add self.model.train() like the fast_evaluate code. However, the loss became 0 after test part like below:

[2, 980] training loss: 29.0062 [2, 980] loss label: 0.6740 [2, 980] loss curve: 2.1232 [2, 980] loss upper: 0.4012 [2, 980] loss lower: 0.6994 [2, 980] training loss aux0: 14.1670 [2, 980] loss label aux0: 0.7017 [2, 980] loss curve aux0: 2.0173 [2, 980] loss upper aux0: 0.2924 [2, 980] loss lower aux0: 0.6955 start validation on epoch: 1 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 227/227 [00:02<00:00, 96.23it/s] [{"name":"Accuracy","value":0.016865079365079364,"order":"desc"},{"name":"FP","value":0.019089574155653453,"order":"asc"},{"name":"FN","value":0.993208516886931,"order":"asc"}] Epoch time: 38.95s [3, 80] training loss: 17.1586 [3, 80] loss label: 0.6709 [3, 80] loss curve: 2.0960 [3, 80] loss upper: 0.4002 [3, 80] loss lower: 0.7055 [3, 80] training loss aux0: 2.4547 [3, 80] loss label aux0: 0.1318 [3, 80] loss curve aux0: 0.3381 [3, 80] loss upper aux0: 0.0546 [3, 80] loss lower aux0: 0.1298 [3, 179] training loss: 13.8403 [3, 179] loss label: 0.6764 [3, 179] loss curve: 1.9323 [3, 179] loss upper: 0.3870 [3, 179] loss lower: 0.6878 [3, 179] training loss aux0: 0.0000 [3, 179] loss label aux0: 0.0000 [3, 179] loss curve aux0: 0.0000 [3, 179] loss upper aux0: 0.0000 [3, 179] loss lower aux0: 0.0000 [3, 278] training loss: 14.7566 [3, 278] loss label: 0.6881 [3, 278] loss curve: 2.1140 [3, 278] loss upper: 0.3753 [3, 278] loss lower: 0.6859 [3, 278] training loss aux0: 0.0000 [3, 278] loss label aux0: 0.0000 [3, 278] loss curve aux0: 0.0000 [3, 278] loss upper aux0: 0.0000 [3, 278] loss lower aux0: 0.0000

If I use LaneDetTester after the whole training it works fine. need some advices~

voldemortX commented 1 year ago

@mengxia1994 It seems the 0 loss does not appear right after validation? Does this happen every time?

mengxia1994 commented 1 year ago

Yes. But after the first validation, The aux0 loss will always be 0

voldemortX commented 1 year ago

Yes. But after the first validation, The aux0 loss will always be 0

that is interesting, could you see similar behavior for other algorithms(other than lstr)?

voldemortX commented 1 year ago

@mengxia1994 I think this could be a bug. Because I rewirte eval() in line 109 of lstr code. I don't have my laptop, could you try add a corresponding train() method that sets everything to True?

mengxia1994 commented 1 year ago

Yes. But after the first validation, The aux0 loss will always be 0

that is interesting, could you see similar behavior for other algorithms(other than lstr)?

Not yet. After a simple trial on all algorithms, I chose lstr to do some more digging and finetune. I tried to add the code above because there is an overfitting and i'm not used or convenient to use tensorboard. Are you suggesting that your first reaction is that it may because of lstr other than some torch settings?

mengxia1994 commented 1 year ago

@mengxia1994 I think this could be a bug. Because I rewirte eval() in line 109 of lstr code. I don't have my laptop, could you try add a corresponding train() method that sets everything to True?

In which file?

mengxia1994 commented 1 year ago

find it, I will have a try.

voldemortX commented 1 year ago

@mengxia1994 I have verified this bug, it affects LSTR, BézierLaneNet & RepVGG, and this should fix it. Thanks a lot for pointing this out!

mengxia1994 commented 1 year ago

@mengxia1994 I have verified this bug, it affects LSTR, BézierLaneNet & RepVGG, and this should fix it. Thanks a lot for pointing this out!

Thank you for your quick fix, I will add these new code~ Actually I tried to add a rewrite train function as you suggested(lol). The training is still on but the print information and val result seems that it worked. If you have time please see whether my modification makes sense. I add a train function in lstr.py after eval:

def eval(self):
    super().eval()
    self.aux_loss = False
    self.transformer.decoder.return_intermediate = False

rewrite train

def train(self, mode=True):
    super().train(mode)
    self.aux_loss = True
    self.transformer.decoder.return_intermediate = True
voldemortX commented 1 year ago

rewrite train

def train(self, mode=True):
    super().train(mode)
    self.aux_loss = True
    self.transformer.decoder.return_intermediate = True

Yes, this mod should bring the same correct behavior for LSTR.

voldemortX commented 1 year ago

This issue seems to be addressed. I'll close it for now. Feel free to continue commenting for reopen/open a new one.