voldemortX / pytorch-auto-drive

PytorchAutoDrive: Segmentation models (ERFNet, ENet, DeepLab, FCN...) and Lane detection models (SCNN, RESA, LSTR, LaneATT, BézierLaneNet...) based on PyTorch with fast training, visualization, benchmarking & deployment help
BSD 3-Clause "New" or "Revised" License
855 stars 138 forks source link

RuntimeError: operation does not have an identity #123

Open mengxia1994 opened 2 years ago

mengxia1994 commented 2 years ago

I meet it when i try to train lstr. Loading targets into memory... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2037/2037 [00:01<00:00, 1385.99it/s] [1, 202] training loss: 32.5229 [1, 202] loss label: 0.7416 [1, 202] loss curve: 2.3253 [1, 202] loss upper: 0.1422 [1, 202] loss lower: 0.7414 [1, 202] training loss aux0: 16.9041 [1, 202] loss label aux0: 0.7343 [1, 202] loss curve aux0: 2.5827 [1, 202] loss upper aux0: 0.1302 [1, 202] loss lower aux0: 0.7638 [1, 405] training loss: 18.8799 [1, 405] loss label: 0.6991 [1, 405] loss curve: 1.3303 [1, 405] loss upper: 0.0926 [1, 405] loss lower: 0.2203 [1, 405] training loss aux0: 9.5051 [1, 405] loss label aux0: 0.7070 [1, 405] loss curve aux0: 1.3525 [1, 405] loss upper aux0: 0.0942 [1, 405] loss lower aux0: 0.2166 [1, 608] training loss: 15.3990 [1, 608] loss label: 0.6896 [1, 608] loss curve: 1.0405 [1, 608] loss upper: 0.0851 [1, 608] loss lower: 0.1731 [1, 608] training loss aux0: 7.6113 [1, 608] loss label aux0: 0.6941 [1, 608] loss curve aux0: 1.0002 [1, 608] loss upper aux0: 0.0848 [1, 608] loss lower aux0: 0.1792 Traceback (most recent call last): File "main_landet.py", line 65, in runner.run() File "/home/mengxia/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 55, in run self.model) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/home/mengxia/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 124, in forward loss, log_dict = self.calc_full_loss(outputs=outputs, targets=targets) File "/home/mengxia/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 136, in calc_full_loss indices = self.matcher(outputs=outputs, targets=targets) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(args, kwargs) File "/home/mengxia/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 71, in forward norm_weights, valid_points = lane_normalize_in_batch(target_keypoints) # G, G x N File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/home/mengxia/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 24, in lane_normalize_in_batch norm_weights /= norm_weights.max() RuntimeError: operation does not have an identity.

mengxia1994 commented 2 years ago

sometimes it come out the error at the begining, sometimes it comes out after several iter like above. I refer to #76. Maybe a same problem. I checked the dataset but found nothing. I'm using custom dataset organized in tusimple type and adjust the input size to (540, 960). So far I have sucessfully trained all the algrithm except lstr, need help~~

voldemortX commented 2 years ago

@mengxia1994 Do you have many no-lane images in your dataset?

mengxia1994 commented 2 years ago

@mengxia1994 Do you have many no-lane images in your dataset?

I also find the problem. It is not actually no lane. A few of them only have 2 or 3 points(others are -2) . However, after transfered to txt, it appears to be 0 0 0 0 0 0(which i believe is because the main direction is left-right and the lane is short). I will delete these cases and have a try.

voldemortX commented 2 years ago

@mengxia1994 Do you have many no-lane images in your dataset?

I also find the problem. It is not actually no lane. A few of them only have 2 or 3 points(others are -2) . However, after transfered to txt, it appears to be 0 0 0 0 0 0(which i believe is because the main direction is left-right and the lane is short). I will delete these cases and have a try.

Good luck! Do tell me if the issue persists.

mengxia1994 commented 2 years ago

@mengxia1994 Do you have many no-lane images in your dataset?

I also find the problem. It is not actually no lane. A few of them only have 2 or 3 points(others are -2) . However, after transfered to txt, it appears to be 0 0 0 0 0 0(which i believe is because the main direction is left-right and the lane is short). I will delete these cases and have a try.

Good luck! Do tell me if the issue persists.

It workes. Thanks for your help! By the way, I have some questions: 1 are you use the default configs to get the best result as Benchmark showed? In my trial experience, almost all the lr are too big, cannot convergence. While the default lstr lr and scnn lr are so different, I think it is not set casually. Despite I am using custom data, the dataset size is similar to tusimple. And I think the distribution in lane detection project(scene) are similar compared to other deeplearning missions. Just want some advices to train and adjust hyper parameter cause of lack of time and machines~~ 2 I haven't found examples for image augmentation in all configs. Is it implemented? 3 How can i add val part during training? How can i print more information like acc/ f1 during training? Can I save more checkpoint models, not just the last one? Sometimes the last several model are very easy to overfit.

voldemortX commented 2 years ago

@mengxia1994

  1. We set learning rate and others based on validation set performance. While TuSimple can be a rather curious dataset, the best lr may be off. You could try some lower lr that is frequently used among all configs. Remember lr should be scaled according to batch size (usually a linear relationship, bigger bs, higher lr).

  2. augs are independently implemented in this repo. You can find aug configs in configs/datasets/, and you can search for corresponding codes by class names.

  3. we have a val-num-steps arg for checkpoint selection. However, that is not supported for lane detection since often times a lane det network performs best (on final test set) at the end. We could really use a checkpointing option though, if you would care to add it yourself, shouldn't be very complex.

mengxia1994 commented 2 years ago

@mengxia1994

1. We set learning rate and others based on validation set performance. While TuSimple can be a rather curious dataset, the best lr may be off. You could try some lower lr that is frequently used among all configs. Remember lr should be scaled according to batch size (usually a linear relationship, bigger bs, higher lr).

2. augs are independently implemented in this repo.  You can find aug configs in `configs/datasets/`, and you can search for corresponding codes by class names.

3. we have a `val-num-steps` arg for checkpoint selection. However, that is not supported for lane detection since often times a lane det network performs best (on final test set) at the end. We could really use a checkpointing option though, if you would care to add it yourself, shouldn't be very complex.

OK I will try~Thanks for your help!

voldemortX commented 2 years ago

lane network online validation currently use seg iou as metric, don't really show much.